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Abstract 



We consider a broad class of nonlinear statistical inverse problems from a Bayesian 
perspective. This provides a flexible and interpretable framework for their analysis, 
but it is important to understand the relationship between the chosen Bayesian model 
and the resulting solution, especially in the ill-posed case where in the absence of prior 
information the solution is not unique. 

Following earlier work about consistency of the posterior distribution of the re- 
construction, we obtain approximations to the posterior distribution in the form of a 
Bernstein-von Mises theorem for nonregular Bayesian models. Emission tomography 
is taken as a canonical example for study, but our results hold for a wider class of 
generalised linear models with constraints. 

Some key words: Bayesian inference, Bernstein-von Mises theorem, inverse problems, 
nonregular likelihood, SPECT, tomography,total variation distance. 

1 Introduction 

Inverse problems are almost ubiquitous in applied science and technology, and because of the 
need for rigorous analysis to characterise such problems, derive numerical solutions and assess 
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their performance, not to mention intrinsic mathematical interest, they have long been the 
subject of intense mathematical study. In the corresponding 'direct problem', (macroscopic, 
global) observational data are predicted from the (microscopic, local) model parameters of 
the system. In the inverse problem conclusions about model parameters are inferred from 
data. 

This paper is a contribution to the theory of inverse problems from a Bayesian perspec- 
tive. Motivated by important problems in tomographic reconstruction, taken as a canonical 
example, we consider asymptotic properties of Bayesian procedures in the small- noise limit, 
for a class of models that we call generalised linear inverse problems. 



1.1 Inverse problems from a Bayesian perspective 

Inverse problems encountered in nature are commonly ill-posed: their solutions fail to satisfy 
at least one of the three desiderata of existing, being unique, and being stable. Thus, the 
focus is not on a unique solution x G W of 

A(x) = y\ (1) 

for given function A and data vector y* G R n , but rather on the corresponding set of solutions. 
Even when the solution a; to (JTJ) exists and is unique for each possible y*, lack of stability 
means that the solution can be extremely sensitive to small errors, either in the observations 
or in computations. To circumvent this, the inverse problem is typically regularised, that is, 
re-formulated to include additional criteria, such as smoothness of the solution: 

x* = argmin_ 4(:c)= ^pen(x), 

where pen(x) is a suitable scalar penalty function. If the inverse problem is ill-posed, the 
regularised solution x* may differ from the actual value x true that generated the data y* = 

•^(•^truc) • 

If the data is observed with error, for example if we observe y modelled as a random 
variable with p.d.f. or p.m.f. p(y | y*) then, allowing for the possibility of lack of existence 
or uniqueness, the likelihood is penalised, and a commonly considered solution of the inverse 
problem is that maximising the penalised likelihood, that is, 

x = argmin^, [— \ogp(y \ A(x)) + Apen(x)] , (2) 

with A a positive constant controlling the trade-off between accuracy and smoothness. 

Penalised least squares was one of the first approaches of this kind in inverse problems. 
While often natural, this corresponds to a Gaussian likelihood, which may not always be 



appropriate. For instance, Dupe et al. (2011) study inverse problems where the observations 



are counts and where a Poisson likelihood would have been more appropriate. 
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We now discuss the penalisation. Smoothness, or other 'regular' behaviour of the solution 
to an inverse problem, is a prior assumption on the unknown x, information about the model 
parameters known or assumed before the data are observed. To use such information thus 
accepts that the required solution must combine data with prior information. In a statistical 
context it is then natural to follow the Bayesian paradigm. 

From this perspective, the solution to (J2J) is immediately recognisable as the maximum a 
posteriori (MAP) estimate of x, the mode of its posterior distribution in a Bayesian model 
with likelihood p(y | A(x)), and in which the prior distribution of x has density proportional 
to exp{— Apen(x)}. 

However, the Bayesian perspective brings more than a different characterisation of a 
familiar numerical solution. Formulating a statistical inverse problem as one of inference 
in a Bayesian model has great appeal, notably for what this brings in terms of coherence, 
the interpretability of regularisation penalties, the integration of all uncertainties, and the 
principled way in which the set-up can be elaborated to encompass broader features of the 
context, such as measurement error, indirect observation, etc. The Bayesian formulation 
comes close to the way that most scientists intuitively regard the inferential task, and in 
principle allows the free use of subject knowledge in probabilistic model building (see, for 
instance, Rover et al. (2007) and Davis et al. (1995)). For an interesting philosophical view 



on inverse problems, falsification, and the role of Bayesian argument, see Tarantola (2006) 

Mathematical analysis of nonlinear inverse problem (TTJ is typically far more difficult 
and technical than for the linear case A(x) = Ax. However, a modest generalisation is 
enough to formulate and analyse a broad range of nonlinear statistical inverse problems of 
considerable practical importance. The model class we consider - that of generalised linear 
inverse problems - is formally defined in Section |3j 

1.2 Convergence of the posterior distribution 

The main mathematical focus in inverse problems concerns how well the true solution can 
be recovered in the presence of noise, as the size of that noise goes to zero. In the case of a 
Bayesian analysis, the focus is on the small-variance asymptotic behaviour of the posterior 
distribution of x. 

For a differentiable identifiable likelihood and prior distribution positive and continuous 
at the "true" value of the parameter, the posterior distribution is asymptotically Gaussian in 
the case where the "true" parameter is an interior point of the parameter space. This result 



is known as the Bernstein-von Mises theorem, van der Vaart (1998) gives a total variation 



distance version of the theorem, adapted from Le Cam (1953) and Le Cam and Yang (1990) 



under mild additional assumptions on the error model. The theorem implies that, under 
the above conditions, the prior has no asymptotic influence on the posterior, that posterior 
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inference is consistent and efficient in the frequentist sense, and that posterior credible regions 
are asymptotically the same as frequentist ones. 

However, for our motivating example of the Poisson inverse problem, and, more generally 
for the class of models we consider, the assumptions of the theorem do not hold. Firstly, the 
assumption of identifiability of the likelihood may not hold if the inverse problem is ill-posed. 
Secondly, the assumption that the "true" value of the parameter is interior to the parameter 
space may not be satisfied. For the tomography example, the unknown parameter is the 
vector of image intensities, which are nonnegative and can be zero, corresponding to holes 
in an organ for example. 

Investigating the literature showed that there has been little study of the asymptotics 
of the posterior distribution for so called nonregular models, i.e. models where the above 
assumptions are not satisfied. Theoretical foundations for the study of the models where 
the Bernstein-von Mises theorem's assumption of the existence of the first derivative of the 
loglikelihood is violated were laid by Ibragimov and Has'minskij (1981) for the case of a one- 
dimensional parameter. These authors considered two types of densities, those with jumps 
and those with singularities. They gave expressions for distributions approximating the pos- 
terior, differing from the Gaussian in both cases, and for the rates of contraction of the poste- 
rior distribution (and hence for the correct order rescaling of the parameter) that also differs 
from the 1 / y/n obtained under the regular assumptions. There were further developments 



in this area by Ghosal and Samanta (1995), Ghosh et al. (1994), and Ghosal et al. (1995) 



these extend the results of Ibragimov and Has'minskij (1981) to i.i.d. models with a regu- 
lar nuisance parameter (IGhosal and Samanta 1995]) . where the joint approximating distri- 
bution asymptotically factorises into the approximating distribution for the "nonregular" 
one-dimensional parameter and a Gaussian distribution for the regular nuisance parameter. 
For densities with jumps, when the limit exists it is a shifted exponential distribution for 
the recentred "nonregular" parameter, with the rate of contraction 1/n; for densities with 
singularities the limiting distribution is more complex. 

Under the conditions of Ibragimov and Has'minskij (1981), Ghosal et al. (1995) proved 
the existence of the limit of the posterior for the appropriately centred and rescaled p- 



dimensional parameter, without specifying the limit explicitly in a general setting. Ghosh et al. (1994) 



characterised the limit of the posterior distribution (for i.i.d. observations) in the particular 
case where the posterior distribution can be "properly centred" . Such a setting applies to the 



regular case, to densities with jumps or singularities, as considered in Ibragimov and Has'minskij (1981' 



who showed that for densities with jumps the limit of the posterior does not always exist. 



Chernozhukov and Hong (2004) considered a class of nonlinear regression models with addi- 



tive errors, where the error density has a jump at 0, that arise in econometric applications. 
The authors showed that the limit of the appropriately rescaled posterior distribution was a 
product of shifted multivariate exponential distributions for the "nonregular" parameter and 
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a Gaussian distribution for the "regular" nuisance parameter of the distribution of errors. 
A particular case of this model where the error density has support on the non-negative 



semiline was considered by Hirano and Porter (2003) 



In this paper, we extend the Bernstein-von Mises theorem in two directions, by relaxing 
both the assumption of identifiability of the likelihood and the assumption that the "true" 
value of the parameter is interior to the parameter space. We consider a broad class of prob- 
ability distributions for the data, generalised linear inverse problems, allowing the likelihood 
to be unidentifiable, and a broad class of prior distributions. We allow linear constraints on 
the solution of the inverse problem and allow the solution of the exact linear inverse problem 
to be on the boundary. 

We will show that for these models a consequence of relaxing these two assumptions is 
that the limit of the posterior distribution, as well as the rate of convergence, depend on the 
choice of the prior distribution and that the limiting distribution is a product of Gaussian and 
exponential in different directions. We identify the directions in parameter space where the 
posterior distribution contracts at different rates. We also show how to derive approximations 
for Bayesian estimators for a given loss function, how to study asymptotic distribution of 
functionals of the parameter, and how these can be used in practice. 

We motivate our study by presenting in Section [2] a nonlinear inverse problem important 
in medical imaging, and Section [3] establishes the class of models we study. In Section H] we 
study geometry of the parameter space determined by the posterior distribution, with an 
illustration for the linear inverse problems with Gaussian likelihood and a Gaussian prior. 
In Section [5] we study local behaviour of the posterior distribution in a neighbourhood of the 
limit that is formulated as a version of Bernstein-von Mises theorem that is illustrated on 
the motivating example in Section El We conclude with a discussion. All proofs are deferred 
to the Appendix. 



2 Motivation 

In this section we consider an important example motivating the class of models studied in 
this paper, generalised linear inverse problems. 

2.1 Single photon emission computed tomography 

Single photon emission computed tomography (SPECT) is a medical imaging technique in 
which a radioactively-labelled substance, known to concentrate in the tissue to be imaged, 
is introduced into the subject. Emitted particles are detected in a device called a gamma 
camera, forming an array of counts. Tomographic reconstruction is the process of inferring 
the spatial pattern of concentration of the radioactive isotope in the tissue from these counts. 
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The Poisson linear model 

yt ~ Poisson(yl t a;) (3) 

independently for different t, is close to reality for the SPECT problem (there are some dead- 
time effects and other artifacts in recording). Here x represents the spatial distribution of the 
isotope, typically discretised on a grid, x = {x s }, and y the array of detected photons, also 
discretised y = {y t } by the recording process. The array A = (a ts ) quantifies the emission, 
transmission, attenuation, decay and recording process; at s is the mean number of photons 
recorded at t per unit concentration at pixel/voxel s. 



See Green (1990) for further detail about the model, and an approach based on EM esti- 
mation for MAP reconstruction of x, in a Bayesian formulation in which spatial smoothness 
of the solution is promoted by using a pairwise difference Markov random field prior. Later, 



Weir (1997) investigated fully Bayesian reconstruction. 

Since Poisson distributions form an exponential family, this model can be seen as a 
generalised linear model ( INelder and Wedderburn 19721) . with identity link function, and 
since A is ill-posed we can call this a generalised linear inverse problem. 

We formalise the notion of small-noise limit for this Poisson model in a practically-relevant 
way, by supposing that the exposure time for photon detection is extended by a factor T, and 
then consider the rate of detection of photons, letting T — > oo. Thus the data-generation 
model becomes 

TY t \x t ruc ~ PoisSOn(TA t X trU c), 

independently, for t = 1,2, ... ,n. 
2.2 Prior distributions 

From the beginning of Bayesian image analysis (IGeman and Geman 1984j Besag 1986), use 



has been made of prior distributions for image scenes that express generic, qualitative beliefs 
about smoothness, yet do not rule out abrupt changes for real discontinuities (for example, 
at tissue type boundaries in the case of medical imaging). 

In common with much of the literature, we will concentrate here on Markov random field 
prior distributions. The 'true image' x trU e in emission tomography corresponds to a physical 
reality, the discretised spatial distribution of concentration of a radioactive isotope. Of course, 
this is non-negative, so we impose the constraints x > (interpreted componentwise), written 
x E X = [0, oo) p C W. 

The first prior model we consider is Gaussian, apart from possible truncation by the 
constraint, 

p(x) oc exp|-^||x-;ro||ij , xeX, 
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where \\u\\ 2 B = u T Bu and B is a non-negative definite matrix. An important special case is 
where xq = and B satisfies B ss > — 1 if s and s' are neighbouring pixels (written s ~ s'), 
otherwise B ss i = 0. Then we have \\x — Xq\\ 2 b = Yl s ~s'( x s ~ x s >) 2 , a pairwise-interaction 
model. In this and other important cases B is singular. 

A second prior model is a log cosh pairwise-interaction Markov random field (IGreen 1 990): 

p{x) cxexp [ Vlogcosh((x s -a; s /)/5) | , x G X. (4) 

Here the parameter 5 is considered to be fixed. 

This model has some attractive properties. While giving less penalty to large abrupt 
changes in x compared to the Gaussian, it remains log-concave. It bridges the extremes 
5 — > oo, the Gaussian model just mentioned, and 5 = 0, the corresponding Laplace pairwise- 
interaction model, sometimes called the 'median prior'. 

These distributions are improper since they are invariant to perturbing x by an arbitrary 
additive constant, but lead to proper posterior distributions, save in exceptional pathological 
circumstances. 



3 Model formulation 

3.1 Generalised linear inverse problems 

Motivated by the emission tomography example, we formulate a general class of inverse 
problems with similar properties that we call generalised linear inverse problems (GLIP). 

We assume that the joint density of the observable responses Y taking values in y c M. n 
(with respect to Lebesgue or counting measure) can be written 

p(y\x) = F(y,Ax,T) = C y<T exp^-^f y (Ax)Y y ey (5) 

for some n x p matrix A. The key feature of these models is that the distribution depends 
on x G X only via Ax, where r is a scalar dispersion parameter; in the Gaussian model, r is 
the variance a 2 . The observed data are generated from this distribution, with x = x true , and 
we aim to recover x trU e as r — > 0. 

We assume a continuous bijective link function G : y MJ 1 and write G(y*) = Ax tvuc . 
(In generalised linear models - see Example 3 below - commonly G has identical component 
functions.) 

We make the following assumptions about the error distribution: 

1. If Y ~ F(y, G(jjlq), t), then Y ^ /i as r -> 0, for all // G G~\AX). 
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2. For all /i G G 1 (AX), f^ {ji) nas a unique minimum over AX at r\ = G(fi ). 

Assumption 1 states that r is not only the dispersion parameter in the model but also that 
the distribution of Y contracts to its expected value as r — > 0. Assumption 2 establishes 
identifiability of the likelihood with respect to the linear predictor 77 = Ax. It is sufficient 
to assume that these conditions hold for fiQ = y* where y* is the "exact" data defined in the 
Introduction. 

A particular case of such models is a linear inverse problem with independent observations, 
where all A is independent of n and r = 1/n. 

For example, Assumption 1 is not satisfied for the Cauchy distribution (or indeed any 
distribution with polynomial decay and with scale depending on r) since the density cannot 
be cast in the form fl5]) for any choice of r. Assumption 1 is satisfied for the power exponential 
(Subbotin) distributions F(y,fi,a) = C a ^ exp{ — [{y — fi) 2 } 13 ^ 2 /a 13 } ((3 > 0), with r = a 13 and 

Assumption 1 is satisfied by generalised linear models. 



Example 1. In the generalised linear models of N elder and Wedderburn (1972), an impor- 
tant class of nonlinear statistical regression problems, responses yt, t = 1, 2, . . . , n are drawn 
independently from a one-parameter exponential family of distributions in canonical form, 
with density or probability function 

PKVt, lh,T) = exp I h d(y t , r 

using the mean parameterisation, for appropriate functions h, c and d characterising the 
particular distribution family. The parameter t is a common dispersion parameter shared by 
all responses. The expectation of this distribution is E(?/ t ;/i t ,r) = fi t = c' (fit) /b' (fit)- Both 
assumptions are satisfied for this example. 

The tomography example given in Section [2] belongs to this class of models, with r = T _1 , 
b(fit) = log fit, c(fi t ) = fit, fit = A t x and X = [0, oo) p . 

As the link function G is continuous and monotonic, we could consider a linear inverse 
problem Ax = y* where y* = G(y*), Y = G(Y) and y = G(y). Hence, to simplify the 
notation, we assume below that the link function is the identity. 

3.2 Bayesian formulation of GLIP 

We adopt a Bayesian paradigm, using a prior distribution with density given by 

p (x) oc exp(— g(x)/ r y 2 ), x^X, (6) 
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where 7 2 is a scalar dispersion parameter for the prior, that may depend functionally on 
r; we relate this to the data dispersion parameter r by 7 2 = r/u, and express most of our 
results below in terms of r and v. Thus the posterior distribution satisfies 

p(x\y) ex exp(-[f y (Ax) + is g(x)}/r), xeX, (7) 

where f y (Ax) was defined by ([5]). 

Denote f y (x) = f y (Ax) and h y (x) = f y (x) + v g[x\ so that p (x\ y) oc e~ hy ^ x ^ T . 

We will assume throughout this paper that X = [0, oo) p . We could assume that the 
parameter x is restricted to an arbitrary convex polyhedron; this could be reduced to [0, oo) p 
by a linear change of variables, and indeed some of the ideas we discuss would hold true for 
more general subsets of MP. 

We shall also assume that y* is either an interior or a lower boundary point of AX. Oth- 
erwise, if y*j is an upper boundary point, one can replace Aj t with — A,-, for the corresponding 
j. We also assume that matrix A has no zero rows or columns. 

We shall use the default norms = ||z||2 for both vectors and matrices. 

The limiting statements are given in terms of a — sfr. 

4 Geometrical perspective 

In this paper, we study inference for x given observed y, in the limit as a noise parameter 
r = a 2 (in the SPECT example, 1/T) goes to 0. We generally assume an identity link 
function, so that y becomes concentrated on Ax tr ue as a 2 — > 0. 

Because of the ill-posed/ill-conditioned character of the problem, we cannot expect con- 
sistency in inference about x trU e based on the likelihood alone. Even as a 2 — > 0, so that y 
converges to 'exact data' y* = Ax tIU e, we will not be able to distinguish between {x : Ax = 

One of the roles of the prior in the Bayesian approach is to resolve this ambiguity (as 
well as generally improve reconstruction through 'regularisation', even without a 2 — > 0). We 
recall the 'physical' constraint in the SPECT problem, that x is componentwise non-negative, 
that is, x 6 X, since it quantifies the isotope concentration. 

Insight into the interplay between the possibly ill-posed likelihood and the possibly de- 
generate prior, and the role of the constraint x G X can be obtained from a geometrical view 
of the problem. 

4.1 Gaussian likelihood and prior 

In this section, we focus on the Gaussian prior p(x) oc exp(—l / (2^ 2 )\\x — xqW^) and Gaussian 
likelihood y\x ~ N{Ax, a 2 I). 
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In the limit as a 2 — > 0, we are interested in solutions of Ax = y*, where y* = Ax tTue , under 
the influence of the prior p(x) oc exp(— l/(27 2 )||x — a^o 1 1 s) • To obtain convergence to a degen- 
erate limit, we will need 7 2 — > as well (though, as shown by Hofinger and Pikkarainen (2007) 
for the case B = I, at a slower rate than a 2 ). 

Thus the posterior is proportional to 

exp(-l/(2a 2 )\\y - Ax\\ 2 - l/(2^ 2 )||x - x \\%) subject to x E X. 

Let us first ignore any constraint on x. By standard manipulations, we can write this posterior 

as 



x 



y ~ M ((A T A + vB)- 1 (A T y + uBx ), a 2 (A T A + vB)' 1 ) 



assuming the inverse matrix exists. A rank condition is needed to ensure this, so that the 
information from the likelihood and prior together define a proper posterior. 

Proposition 1. Suppose that A is a real nx p matrix, and B a real symmetric non-negative 
definite pxp matrix, both possibly of deficient rank. Suppose also that the px2p block matrix 
[B : A T A] has full rank p (or equivalently , the rows are linearly independent) . Then for all 
v > 0, A T A + vB is nonsingular. 

It follows that there exists a nonsingular real matrix P, not necessarily orthogonal, such 
that P T BP, P T A T AP, and P T (A T A + uB)P (for all v > 0) are all diagonal. 

Furthermore, there exist well-defined finite non-negative definite matrices C and D with 
ranks p — q and q respectively, where q = rank(A), such that v(A T A + vB)~ l = C ' + Du + o{v) 
as v — t- 0. 

The last part of the proposition gives us a full description of the posterior variance 
matrix as a 2 — > 0, 7 2 — > while v = a 2 j^ 2 — > 0. In summary, the posterior distribution is 
Gaussian, with variance scaling differently in different directions. If q is the rank of A, then 
asymptotically the variance has q eigenvalues scaling like a 2 and the remaining (p — q) like 
the (larger) 7 2 . Geometrically, contours of equal posterior density are concentric ellipsoids 
in W p . 

As a 2 and 7 2 — )■ in such a way that v = o 2 j^ 2 — > 0, the posterior converges to the 
point 

x* = &rgmm xeX:Ax=y *\\x - x \\ 2 B , (9) 
a point that is uniquely determined under the conditions of Proposition 1. 



4.2 Constrained Gaussian model and KKT theory 

When X is a proper subset of MP, the ellipsoidal contours are truncated by the constraints 
x G X. In the case of interest in SPECT, where we have simply componentwise non- 
negativity constraints, the ellipsoids are truncated into the non- negative orthant. As a 2 
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and 7 2 become small, there are clear qualitative differences in the impact of this truncation 
according to whether the centre (A T A + uB)~ l (A T y + vBxq) of the ellipsoids lies in the 
interior of the orthant, on its boundary, or outside it. 

Equation fl9]) is a quadratic programming problem, and could be solved numerically by 
standard software. 




Figure 1: Illustrating the geometry in the case p = 2, n = 1, with B = I. Contours of 
posterior when 7 2 > a 2 > 0. 

We can get a theoretical handle on the solution through Karush-Kuhn-Tucker theory 
( IKuhn and Tucker 1951[) . In the non-negativity constrained case, X = [0, oo) p , to minimise 



\\x — xq\\ 2 b subject to x > and Ax = y* it is necessary and sufficient to find (a;*,/z, A) G 
W x W x W l such that 

B{x* -x ) -n + A T \ = 
x* > 0, Ax* =y\ ^ > 
for all s, n s = or x* = 

The feasible set X* = {x G X : Ax = y*} is closed and convex, and x* may be an interior 
point, or satisfy one or more of the constraints x s = 0. 

In the case where all entries of A are non- negative (in accordance with physical reality), 
and for each s there is at least one t with A ts > (and if not, then x s is unidentifiable, 
so might as well be omitted from the model), X* is a bounded polyhedron (or polytope). 
Otherwise, X* may be unbounded. 

If 7 2 remains bounded away from as a 2 — > 0, then, in the limit, the posterior has 
support X*. 

If x* is an interior point of X*, there exists a neighbourhood of x* that lies inside X, and 
hence, on this neighbourhood, the posterior distribution is not truncated. In this case, as a 2 
and 7 2 — > 0, the posterior distribution behaves as in the unconstrained case. If x* lies on the 
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boundary of X*, there are two possibilities: either the unconstrained minimum is achieved 
at x*, or outside X. In the first case it is easy to see that the posterior distribution of x 
recentred by x* is "half-Gaussian" (a multivariate Gaussian distribution centred at and 
truncated at where x* is on the boundary). Thus, in a neighbourhood of x*, the posterior 
distribution has similar concentration ellipsoids as in the unconstrained case, but truncated 
at x* in some directions (these directions will be defined precisely in Section POl) . However, 
in the second case, where the unconstrained solution to the optimisation problem lies outside 
X, for small a and 7, the posterior distribution no longer exhibits Gaussian behaviour in the 
directions orthogonal to the boundary (where x* is on the boundary). This is essentially a 
consequence of the tail behaviour of the Gaussian: if £ ~ A/"(0, 1) and x > 0, 

lim P(f > t + x/t I £ > t) = e~ x . 

t—>oo 

The precise formulation of the limit of the posterior distribution is given in the Section El in 
a more general case. 

4.3 Geometry in a general constrained case 

The form of (Q strongly suggests that analogous properties for the limit of the posterior 
should hold in a much broader class of models. Provided that o 2 — > and 7 2 — > in such a 
way that v = a 2 /'-/ 2 — > 0, we would expect similar limiting behaviour under the assumptions 
in Section El 

In a general setting, more delicate, analytic, arguments will be needed to quantify the 
convergence precisely. However, for regular problems, the broad qualitative features of the 
solution for the Gaussian-Gaussian case (Section HTTI) continue to hold: the posterior becomes 
increasingly concentrated near the hyperplane {x : Ax = y*}, with its variation about this 
hyperplane controlled by r, while the variance parallel to the hyperplane is of order j 2 . The 
effect of the truncation onto x E X depends on whether in the absence of the constraint, the 
maximum of the posterior would lie in the interior of X, on its boundary, or outside it. 

Now we describe the local geometry of the posterior distribution around the point x*, 
under the assumption that functions f y * and g are differentiable, relaxing the assumption 
that x* is an interior point by allowing it to lie on the boundary of X. Such a model is 
nonregular. 

Throughout, we use Vj = as the derivative operator, and V = (Vi, . . . , V P ) T as the 
gradient. Similarly, and Vijk are operators of the second and third derivatives, with 
V 2 = (Vjj) being the matrix of second derivatives. 

In the limit of zero noise, the Bayesian analysis has solved two optimisation problems: 

X* = arg min f v *(x), 

xe[0,oo)P,Ax=y* 

x* = argrmng(£). (10) 
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We assume that the prior distribution is such that x* is a unique solution. Denoting rj = Ax, 
the first problem can be reformulated as follows: 



y* = ar s^^f y *(v)- (11) 

This condition is the identifiability of the likelihood with respect to rj = Ax. The second 
expression is the definition of x*, the point where the posterior distribution concentrates, 
which depends on the prior distribution. 

Now we use the Karush-Kuhn-Tucker (KKT) theory to study the local geometry of the 
solution. If the solution of the optimisation problem x* is an interior point of X* = {x G X : 
Ax = y*}, then 

= (j^9& +V- Pat)z)\ z=0 J = (/ - P A r)Vg(x*), (12) 

where P a t is the projection on the range of A. However, if x* is on the boundary, the 
gradient Vg(x*) may not be zero. This corresponds to the maximum in the unconstrained 
case lying outside X. In this case, the KKT conditions are: 

Vjy* (Ax*) > & [Ax*]j Vjf y * (Ax*) = 0, j = 1, . . . , n, (13) 
Vg(x*)=A T \ + ( & Qx* = 0, i = l,...,p (14) 

for some A G W 1 and C G [0, oo) p . 

Define the sets of the nonregular boundary components of y* and of x* by 

S = {i G 1,2, . . . ,p : Q > 0}, Z = {j G 1,2, . . . ,n : V J y *(Ax*) > 0}, 
S* = {i G 1, 2, . . . ,p : x* = 0}, 7: {./( 1.2 //: [Ax% = 0}. 

By the KKT conditions, Z C Z* and S C S*. If S ^ S* or Z ^ Z*, the corresponding 
minimum is achieved on the boundary and the gradient is zero. 

In the small noise limit, we will show that the posterior distribution exhibits different 
types of behaviour on 4 subsets Wo, Wi, W2, W3 of W such that X — x* = {z = x — x*, x G 
X} = Wo © Wi © W2 © W3. These four subsets are determined by p x matrices 14 of 
rank p fe : W fe = {EJii[^fc]i « e for fc = 0, 1 and W fc = {E^i^li,^, « e } for 
k = 2,3, and are their dimensions, where the matrices Vt satisfy the following conditions: 

A Z V = 0, AVi = 0, AV 3 = 
VqA^c AzcVo and V^V\ are positive definite, 
Vi T ( = 0, V3 T C is a vector with positive coordinates ((s G M+'), 

V^V f y *(x*) is a vector with positive coordinates. (15) 
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where Vq and V\ are the matrices of the largest size satisfying the conditions. These conditions 
imply that 

Po = rank(A) — rank^^,), Pi = p — rank(A T : Vg(x*)), (16) 
P2 = rank(A^), p 3 = rank(A T : Vg(x*)) — rank(A). 

Note that p 3 can be either or 1. The four subsets can be characterised as follows: 

a) Wo: likelihood is identifiable, projection of Vf y *(x*) on Wo is 0; 

b) Wi: likelihood is not identifiable, projection of S7g(x*) on Wi is 0; 

c) cone W2: likelihood is identifiable, projection of Vf y *(x*) on W2 is nonzero (projection 
of x* on W2 is on the boundary of W2); 

d) cone W3: likelihood is not identifiable, projection of Vg(x*) on W3 is nonzero (projec- 
tion of x* on W3 is on the boundary of W3). 

The four matrices Vk define a transform from x to w = (wq, wj , w 2 , w^) T G W 0+Pl x W+ +Pa : 

3 



x = x 

k=0 



Lemma 1. // (Vk) satisfy conditions (II 5p . then the matrix V = (Vq : V\ : V 2 : V3) has full 
rank. 

Now we propose a way to construct the matrices V and V~ x . 

Definition 1. Define Z 2 to be a subset of Z such that rank(Az) = rank(Az 2 ) = \Z 2 \, so 
that for every j G Z , Aj can be written as a linear combination of vectors {Aj, j G Z 2 ) with 
nonnegative coefficients: Az t = cnAz 2: where a G [0, oo)' z ' x ' Z2 ' . 

Define Zq C Z c such that (A^, j G Zq) are linearly independent and (Aj y , j G Z2 U Zq) 
are linearly independent. In particular, Po = \Zq\ = rank(A) — \Z 2 \. 



The subset Z 2 exists by Caratheodory's theorem (p. 37 of Bertsekas (2006)). 
Here is one way of constructing the matrix U = V~ x . Let V\ G M pxpi satisfy 

(A T : Vgix^fV, = 0, V?V X = I pi , (18) 
rank(Vi) = p — rank(A T : Vg(x*)) 

such that the matrix [Vijs^ is of highest possible rank. Take U = (Uq : lf[ : U 2 '■ UJV with 

U = A Zo „ U l = V^ U 2 = A Z2 „ U 3 = ( T . (19) 
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In Proposition H] below, we give the explicit expression for V = U -1 , and study the image 
of X — x* under the transformation U, rescaled as follows. 

Define the following scaling transform S = S ari : X — x* — > W 0+Pl x M+ +P3 : 

S(x-x*) = D-iV- 1 (x-x*), (20) 

where D aa = diag(cr/ po , jl pi , o~ 2 I P2 , 1 2 Ip z ) an d V — (Vo : V\ : V 2 : V3) is defined by (jT5l) . 
This corresponds to rescaling each of the four subsets independently. As we shall see in 
Theorem [TJ this is the appropriate scaling for the posterior distribution. 

Proposition 2. Let the matrix U be defined by (1T91) with V\ satisfying (1TB]) . Then, 

1) the matrix V = U^ 1 satisfies conditions (iTBT) . 

2) the matrices Vo, V 2 and V3 are defined by 

V 3 = C7iiCii 2 , V k = (I - P m JA T z JA z JI - P m ,c)A T Zk ,r\ (21) 

for (k,m) G {(0,2), (2,0)}, with projections on the range of A T Zii (Pq), on the range of 
(I - Pq)A t Z2j (P 2fl ) and on the range of (I - P<;)A T Zky (P k J: 

P = P a t o , P 2 , — (I — P )A T Z JA Z2 XI - P )A T Z J- 1 A Z2 XI - P ), 

P< = CC T /IICH 2 , Pu,c = (/ - P C )A T Zk \A Zk Xl - Pd^z k ]- l A Zk Xl ~ P f ) 

C=(/-P 2)0 )(/-Po)C 

6/ Pit = 0, i/ien £/ie projection matrix that uses the corresponding U k is zero); 

3) if Z = Z* and S = S* and \S\ = p 2 + P3, the linear transform S = D^V^ 1 maps 
X - x* onto Rpo+pi x ^ 2+P3 under conditions fl28|) .- 

4) more generally, under conditions ( |28l) . £/ie linear transform S = D~ly~ l maps X — x* 
onto V* C M P0+Pl x W? +P3 defined by 

_ J {(v ,vi,v 2 ,v 3 ) : [V k ] Sk v k > 0, k = 0,1,3} 1/ c = 0, 

\ {(v , «i, «2, u 3 ) = c[V k 3 v + [Valsos.va > & [VL] Sl «i > 0} if c> 0, 

where the inequalities are component-wise, S03 = So U S3 and 
S-j = {£ E S* : ± 0}, 

S 3 = {ieS*: [Vfo = & [V 3 ] e 0}, (22) 
So = {£ G S* : [V^, = & [V& = & [VoU, ? 0}. 

In particular, \Sq\ < \Zq R Z*| and Si aas at most s = |S*| — P2 ~ P3 constraints. 
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In Section [5] we show that the posterior distribution has different asymptotic behaviour 
in these four sets of directions. Now we shall look at some examples. 

Example 2. Consider the Poisson likelihood with identity link: Y/a 2 ~ Pois(Ax/o~ 2 ) (n = 1, 
p = 2), with A = (1, 1). We take x true = (1, 0) T , so that y* = Ax trne = 1. The linear inverse 
problem Ax = y* , i.e. x\ + X2 = 1, subject to constraints xi, x^ > 0, is ill-posed. To resolve 
the ambiguity, we use the penalty \ \x — x \\ 2! with two different x . 

1. xq = (4, 2) T . Then x* = (1, 0) T . In this case, p 2 = since for r) — y* — 1, V v fy*{vi) = 
— 1/77 + 1 = 0. Thus, po = rank(A) = 1, and we take Uq = A = (1, 1). The null space 
of A is {a(l, — 1) T , a 6 1R}. The gradient of the negative log prior at x* (up to a factor 
l/'y 2 ) is —(3, 2) T = —3 A T + ( where ( = (0, 1) T . The gradient is not orthogonal to the 
null space, hence p% = 1 and thus p\ = 0, with U3 = (0, 1) T . The corresponding Vq and 
V3 are Vo = (1, 0) T and V3 = (—1, 1) T . The conditions of item 3) in Proposition^ are 
satisfied hence the image of X — x* under S in the limit isKx M + . 

2. Xq = (3,3) T . Then x* = (0.5,0.5) T . Since y* is unchanged, we again have pi = 0. We 
can take the same Uq = (1, 1) T . The gradient of g at x* here is x* — xq = —2.5 (1, 1) T 
and is orthogonal to the null space of A, {a(— 1, l) T ,a 6 M}. Therefore, we have V\ = 

= \/0.5(— 1, 1) T . Since the kernel is one-dimensional, p% = 0. Here Vq = 0.5(1, 1) T . 
Here we are again under conditions of item 3) in Proposition^ hence V* = M. 2 . 

Further examples can be found in Sections 15.41 and 16.11 



5 Analogue of the Bernstein— von Mises theorem 
5.1 Assumptions on the likelihood and the prior 

In addition to assuming that we have a GLIP model with r = a 2 , we make the four main 
assumptions that the posterior distribution is proper, that the log likelihood and log prior 
density have bounded third order derivatives with respect to x, that the log likelihood and 
its first two derivatives are continuous with respect to y, and that the posterior distribution 
is concentrated in a neighbourhood of x* . 
Assumption P. 

We assume that the prior distribution is such that the posterior distribution is proper: 

3a > : Vtr ^ <r , / e~ hv{x)/a '' dx < 00 forP y * almost all y E y. 
Assumption S (smoothness in x). 
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There exist 5 k > 0, k = 0, 1, 2, 3, such that there exist uniformly bounded derivatives up to 
third order: 3/jJ m) , 3g^ on B s (x*) for F y * almost all y G }>, m = 1, 2, 3, and 3C /3 , C g3 < oo 
such that for all x G B s (x*) : with probability — >■ 1 as a — >■ and all 1 ^ k ^ p, 

iVijfc/^x)! ^ C /3 , |Vi^(x)| < C 93 , (23) 

where B 5 (x*) = {x G X : - a;*) G 54 and 5 5 = B 2 {0,S Q ) x £ 2 (0,5i) x 5^(0, S 2 ) x 

Soo(0,5 3 ). 

Assumption C (continuity in y). 

The derivatives of fy(x*) converge to the corresponding derivatives of f y *(x*) with prob- 
ability — > 1 as a — > 0, i.e. for all 1 ^ j 1: . . . ,j d ^ p with d = 0, 1, 2, 

V,. ,,./v(.<-)- V, : ;,/„.(.»") X) as a X). (24) 

These assumptions are satisfied if 3V^ / W (x) for d — 1,2,3 and these derivatives are 
bounded for P^* almost all fi £ 3^- 
Assumption L. 

For 8 k > defined in Assumption S, 

P(A (5) ->■ 0) ->■ 1 as a ->• 0, (25) 

where 

A (<J) = (t-^-^t-p 1- ^ 3 / e -(M*)-M**))/- 2 ^ ^6) 

JX\B s {x*) 

where p fc is the non-degenerate dimension of U k (X — x*), k = 0, . . . , 3. For k — and k = 2, 
Pk = Pk- This assumption implies that for small a, the posterior distribution is concentrated 
on Bs(x*). 

5.2 Notation 

The limiting behaviour of the posterior distribution is characterised by the matrices of second 
derivatives: 

V y {x) = V 2 f y (Ax), B(x) = V 2 g(x), 
H y (x) = V 2 h y (x) = A T V y (x)A + uB(x), 
^oo = VZ'V 2 fa(x*)V = VZ'A T V u >(x*)AV , 

and by the following projections of the gradients: 

a = V 2 T Vf r (x*), ciy = V 2 T {Vf y (x*) + vVg(x*)\, 

b = V 3 T Vg(x*) : (27) 



17 



where H 00 = V^H y (x*)V , and By = V^BVj, i,j = 0,1,2,3. By the conditions (TT5|) . a is 
a vector with positive coordinates, b > if P3 > 0, and f2oo and 5n are positive definite. 
For U and V considered in Proposition |2l 6 = 1 if p 3 > 0. The smallest components of the 
vectors a y and a will be denoted by a y ^ n = mhij a y ^ and a m j n = min, respectively. 



5.3 The Bernstein— von Mises theorem 

In the theorem below, which is an analogue of the Bernstein-von Mises theorem, we show 
that the posterior distribution converges to a finite limit, after the rescaling and the change of 
variables defined in Section 14.31 This can be used to approximate the posterior distribution 
in practice, for small values of a, and to study asymptotic properties of Bayes estimates. 

Theorem 1. Consider the Bayesian GLIP model defined in Section^ such that matrix A 
has no zero rows or columns, and let Assumptions P, S, C and L on f y and g stated in 
Section [3T71 hold. Assume also that, as a — > 0, 

- 0, 7^0, c = lim < oo. (28) 

For Vk, k = 0, . . . , 3 ; satisfying conditions ( TT5l) . we assume that Bqq — Bq\B^ B\q is 
positive semi-definite, and that the following limit exists for allu: 



a (uj) = fioo 1 



lim>-V T V/y M (:r*)] + cV T Vg(x* 

(j — 



Define a random probability measure on V* = lim^o S{X — x*) C Rpo+P 1 x M^ 2+P3 ; 

fj*(u) = {Af P0 (a (u),n J) x Af Pl (O,^ 1 ) x Exp p2 (a) x Exp p3 (6)} l v *, 

where transform S is defined by f[2TJ|) . Exp m (v) is the distribution of an m- dimensional vector 
£ with independent coordinates ^ ~ Exp(vi), and ly* is the indicator function of set V*. If 
support V* degenerates to a manifold of a smaller dimension, then fi*(u)) is normalised be a 
probability measure on this manifold, and it has mass 1 on the degenerate part of MP. 
Then, as a — > 0, 

Pre 

I \^S(x-x*)\Y — f*\ I TV 0. 



If pk = then the corresponding factor in the definition of fi* disappears. In particular, if 
x* is an interior point, the limit is Gaussian distribution with no constraint on its support. If 
the likelihood is also identifiable, the statement is the classical Bernstein-von Mises theorem. 

If x* or Ax* is on the boundary of its corresponding parameter space, then there is at 
least one direction where the posterior distribution converges faster than the convergence 
around an interior point. If the plane Ax = y* does not intersect the boundary at the right 
angles, i.e. if the positivity constraints on the coordinates imply nontrivial constraints after 
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the change of variables, it is possible to have degeneration of the support of the posterior 
distribution of S(x — x*) due to mutually exclusive constraints (see Examples H] and EJ) . 

Assumption of finiteness of a (u) implies that V^f (V f Y /u)(x*) — V f y *(x*)) not only con- 
verges to zero (Assumption C) but also that it converges at rate a or faster. This holds in 
the case where Yj are independent and their distribution belongs to the exponential family, 
since f Y (r)) = ~ YTj=i b(.Vj) a (Yj) ~ Y^j=i c (Vj)^ an d ^s variance is proportional to a 2 . 

The assumption of the existence and boundness of the third derivatives of f y and g can be 
relaxed. It is sufficient to assume that the supremum of the absolute value of each component 
of V T V 2 f y (x*)V , V^V 2 g(x*)Vi, V 2 T V f y (x*) and V^Vg(x*) on B s (x*) converges to zero as 
5 —7- at the appropriate rate, with probability 1. 

We will also state a nonasymptotic bound on the distance between the posterior distri- 
bution of the rescaled parameter and its limit. 

Proposition 3. Take 5k > 0, k = 0, . . . , 3, satisfying the following conditions 

max[0.5z/||5oo||//^,cr||a (u;)||] < So < min[0.2A min (fi o)/^, I \U x*\ |], 
<Ji < mm[\ mill (B n )/n B , x k < ||#jfeZ*IU, k = 2,3, 

0.5v\\V 2 T Vg(x*)\\ < max [5 k c 2 . k ] < 0.2a min /3, 

fc=0,l,2 

max [S k c 3jk ] < 6 min /4, 

fc=0,l,2,3 

where constants c mk are defined in the proof, and the inequality 5 > cr || a o( w )|| holds with 
high probability. 

Define the following events 



Ai{6) = {u: \W\yf Y ( u ){x*)-Vfa(x*)]\\ 00 <M 1 mwL6 k }, 

k 

A 2 (5) = {lu: \\V T [V 2 f Y(w) (x*)-V 2 f y *(x*)]V \\ <M 2 5 }, 

A 3 = {co: \\n 00 a (u;) - cV T Vg(x*) - a^VfV 7i»«)|| < p) , 

for some p — > as a — > and positive constants Mi and M 2 . 

Then, under the assumptions of Theorem^ on A = A\ D A 2 D .43, 

\Ws(x-x*)\y -H \\tv < 2p Q m&x<C 8 , 1 -T I - ' T J | 

+ 2pimax jCidi, 1 - T I — 2 \ — 

+ 2p 2 max {C 2 5 2 , exp{-a min 5 2 /a 2 }} 
+ 2p 3 max {C 3 5 3 , exp{-65 3 /7 2 }} 

+ C B {C 4 [5 /7 + 5 2 h + 5 3 / 7 ] 1511 + C 5 [5 2 /7 2 l ' 
+ C A A (5), 
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where pk is the dimension of the non-degenerate part of U}~(X — x*), k = 1,3, and the 
constants are defined explicitly in the proof 

The upper bound implies that for the total variation to be small in practical applications, 
the dimensions pk should not be too large compared to the corresponding rate, and that the 
smallest values of the parameters a m i n , b and the smallest eigenvalues of the precision matrices 
fioo an d Bn cannot be too small, namely cannot be smaller than the corresponding rate. 

It is interesting to note that the smallest 5k satisfying the local constraints (given in the 
first four lines of the upper bound) coincide with an upper bound on the Ky Fan distance 
between the posterior distribution and its limit 8 X * on the corresponding subspace/cone of the 
parameter space (IBochkina 20120 . Thus, it appears that the Ky Fan distance determines the 
radius of the largest ball centred at x* where the concentration of the posterior distribution 
can take place. 

5.4 Examples of degenerate support 

It follows from Theorem [1] and Proposition [2j that in some cases the limit of the posterior 
distribution of S(x — x*) can degenerate due to degeneration of its support. 

In a first example we consider the degeneracy where the solution of the unconstrained 
optimisation problem coincides with that of the constrained problem and occurs on on the 
boundary of the parameter space. 

Example 3. Consider the Gaussian likelihood with the identity link: Y ~ N(Ax, a 2 ) (n = 1), 
with A = (1,1). We take x truc = (0.8,0.2) T , so that y* = Ax tIU e — 1- The linear inverse 
problem Ax = y* , i.e. X\ + X2 = 1, subject to constraint xi, x% > 0, is ill-posed. To 
resolve the ambiguity, we use the penalty \\x — Xolh? with xq = (3,2) T . Then the solution 
to the constrained optimisation problem is x* = (1,0) T ; the same as the solution to the 
unconstrained problem. 

The gradient Vf y *(Ax*) = — y* /(Ax*) + 1 = implies W2 is empty and p<i = 0. The 
gradient of g at x* is x* — xq = — (2,2) T = — 2 A T + ( where ( = (0,0) T . Thus, W3 is also 
empty even though x* occurs on the boundary. 

Therefore, we have that Wi and Wo are nonempty, with U = (1, 1) and U\ = = 



v / 0.5(— 1, 1) T , since the kernel of A is {a(l, — l) T ,a G R}. The corresponding Vq is Vq = 

0.5(1, 1) T . By statement 4) in Proposition^ the image of X — x* under S is K x M +; since 
the only constraint is [ViJsrUi = yO^Evx > 0, i.e. V\ > 0. Hence, //* is normal distribution 
N(Z(u), 1) x JV(0,1) truncated io V* = I x t + where Z(u) = (Y(co) - I) /a ~ jV(0,l). 

Another case where the support can degenerate is where the kernel of A intersects the 
constrained parameter space at a single point. We give two examples. 
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Example 4. Consider a linear inverse problem with A = (1,1) and x true = (0,0) T . Then 
y* = Axtme = 0. Under the Poisson error model with the identity link, the true distribution 
of data is degenerate: F trne (Y = 0) = 1. Even though ill-posed, the linear inverse problem 
Ax = 0, i.e. Xi + X2 = 1, subject to constraints x±, Xi > 0, has a solution x* = (0, 0) T that 
is unique due to the constraints, for any penalty. 

Since y* = and V f y *(Ax*) = 1, we have Z = {1} and thus U 2 = (1, 1) and po = 0. 
Take penalty \\x — x \\ 2 with x = (a, (3) T G (0, oo) 2 . The kernel of A is {a(l, — 1) T , a G K} ; 
Vg(x*) = -x . 

If a = (3, (l,-l)Vg{x*) = and hence V x = = y/0l(l,-l) T and p 3 = 0. By 
Proposition^ V2 = 0.5(1, 1) T and the constraints are \/0.5vi > and —y/0.hv% > due to 
Si = {1,2}, that imply that v\ must be zero. 

If a ^ (3, then pi = and U 3 = ((/3 — a) + , (a — /?)+). Taking (3 = 1 and a = 2, we 
have U3 = (0,1), V2 = (1, 0) T and V3 = (— 1, 1) T . By statement 4) of Proposition^ the 
constraints are —v 3 > and v 3 > 0, implying v 3 = 0. 

Hence, in both cases, the limit of the posterior distribution of S(x — x*) is Exp(l) for the 
first component, and the second component is with probability 1 with V* = M + x {0}. 

Now we consider a higher-dimensional example of this phenomenon. 

Example 5. Take A = and x tmc = (0,0, 1,1) T . Then y* = Ax true = (2,0) T . 

The likelihood is Poisson with the identity link: Yijo 2 ~ Pois(Aix/o~ 2 ). The linear inverse 
problem Ax = y*, i.e. x\ + X2 = and X^i=i x * = 2, subject to constraints Xi > OVi, is 
ill-posed. The constraints are equivalent to x% + X4 = 2, x 3 , X4 > 0, X\ = X2 = 0. To resolve 
the ambiguity, we use the penalty \\x — Xo\\ 2 with x$ = (1, 3, 4, 1) T . Then x* = (0, 0, 2, 0) T . 

To construct U 2 , find V r/i /j / *(r/) = —y*/r]i + 1, thus V v f y *(y*) = (0, 1) T and Z = {2}. 
Thus, U 2 = (1,1,0,0) and U = (1,1,1,1). 

The null space of A is {a(l, —1, 0, 0) T + (3(0, 0, 1, — 1) T , a, (3 G M}. The gradient of g at 
x* , x* — Xq = (—1, —3, —2, — 1) T , is orthogonal to a(l, —1, 0, 0) T + /3(0, 0, 1, — 1) T - a direction 
in the null space of A - if (3 = 2a. Thus, V\ = \/0.1(l, —1, 2, — 2) T is the direction ofW\. A 
vector ( that satisfies (|T4"I) is ( = (2, 0, 0, 1) T implying S = {1, 4}, and U 3 = (2, 0, 0, 1). This 
implies that U\ = \/0.1(l, —1, 2, —2) and 

H = 0.2(-1,1,3,2) T , \/ 2 = 0.1(3,7,-4,-6) T , V 3 = 0.2(2, -2, -1, 1) T . 

Now we check the constraints using Proposition^ We have s = \S*\ — p 2 — p 3 = 1, and the 
constraints are [Vi]eVi > for £ G S\ = S* = {1, 2, 4}, i.e. we must have V\ = 0. 

Hence, fx* = Af(y/2Z(u) - Ac, 2) x 5 {0} x Exp(l) x Exp(l) truncated toV* = Rx {0} x R 2 + 
where Z(uf) = lim CT ^. (^i(w))/v / 2 — 1) ~ A/"(0, 1) in probability. 
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5.5 Approximate Bayes estimators 

Now we apply the approximation of the posterior distribution stated in Theorem [1] to 
approximate the distribution of Bayes estimators. We use a similar approach to that of 



Chernozhukov and Hong (2004) , by approximating the distribution of recentered and rescaled 



Bayes estimates with the distribution of the corresponding Bayes estimates obtained under 



the limiting distribution, for a wide class of loss functions. The approach of Chernozhukov and Hong (2004) 



relies on Theorem 1.10.2 in Ibragimov and Has'minskij (1981) that implies this result from 



an analogue of Theorem [TJ under some additional conditions. 

Assumption Q. We consider loss functions Q : W — > [0, oo) satisfying the following 
properties: 

1. Q(z) > 0, Q(z) = if and only if z = 0, and Q is convex; 

2. Q(z) is dominated by a polynomial in ||z||oo as ||^||qo — * °°- 

This loss function is applicable to the modified parameter v = S(x — x*), and the corre- 
sponding Bayes estimate of v is 



v Q = arg inf / Q(v - v')p S (x-x*)\ y (v')dv' , 

ves(x-x*) J S ( X _ X *) 

where Ps{x-x*)\y is the density of the posterior distribution of S(x — x*) with respect to 
Lebesgue measure. Since S is a bijection, the corresponding Bayes estimate of x is 



xq = arg inf / Q(S(x — x'))p(x' \ y)dx' , 
xexj x 



and vq = S(xq — x*). Thus, the corresponding loss function for x is Q(x) = Q(Sx). 

In the next theorem we state the asymptotic distribution of xq as a — > 0. If c = 
lim^o ° I J 2 — and there is no degeneracy of the support (the conditions of item 3) in 
Proposition |2] are satisfied), then V* factorises to V* = ®| =0 Vfe where 14 C MP k for k — 0, 1 
and Vfc C MJ± for k = 2,3. If c > and there is no degeneracy, then the factorisation is 
V03 x Vi x V2, up to a permutation of the coordinates, since in this case the scales of Vo and V3, 
a and 7 2 , are of the same order and there are joint constraints on vq and V3 (Proposition |2]). 
To simplify the statement, let be the map from X — x* to V£ (So = (tU , Si = jUi, 
S 2 = cr 2 U 2 , 1S3 = 7 2 t/ 3 ) and fi* k be the marginal distribution on V^. If there is no degeneracy 
of the support, the marginal distribution is Gaussian on Vo and Vi and exponential on V2 
and V3 (Theorem [1]). 

Theorem 2. Suppose that conditions of TheoremU\hold, and that the loss function Q satisfies 
Assumption Q. Then, for xq defined above, 
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(1) S(xq — x*) — > Vq as o — > 0, where 

Vq^) = ar § inf / Q(v - v')dn*(v', u), 
vev* J v , 

and fi*(-,u) is the limit of the rescaled posterior distribution defined in Theorem^ 

(2) if Q(v) = Y.l=oQk( v k) and V* = <S) k=0 V k where v = (v ,v 1 ,v 2 ,vA , v k G V k , k = 
0, 1, 2, 3, then 

S k (x Q - x*) A Vg k (uj) = arg inf / Q k (v k - v' k )di4{v' k , u)dv' k . 

■"k^-Vk J Vk 

Here A denotes convergence in distribution. 

This theorem establishes consistency, rates of convergence, and limit distributions of 
Bayes estimates. 

For example, for the quadratic loss function Q(z) = [ )^| ||, Vg k is the mean of the 
corresponding limiting distribution, i.e. Vq (uj) = a (oj), Vq 1 = 0, Vq 2 is the vector 
(1/ai, . . . , l/a P2 ) and Vq 3 is the vector (1/fei, • • • , l/b Pz ). To obtain the MAP estimate, we 
take Q(v) = Y^=i H\ v i\ — e )/ e _ an approximation of the Dirac delta function that satisfies 
the above assumption, obtain the corresponding Bayes estimate and take the limit e — > 0, 
independently of a. Thus, it can be shown that the rescaled and recentred MAP estimate 
S(xs — x*) converges in distribution to the mode of the density of fi*(uj). 

6 The Bernstein— von Mises theorem for SPECT 
6.1 Approximation of the posterior distribution 

Consider the SPECT model defined in Section [2j and allow some coordinates of y* to be 
zero. This model is nonregular, since F y -*. (Yj = 0) = 1 for j G Z; hence, with probability 
1, Vjfy(Ax*) = 1 = Vjf y *(Ax*), and, since we assume that matrix A has no zero rows or 
columns, 

n 

j-y*j¥=o j=i jez 

The nonregularity arises from the elements where there is no data (y* ■ = 0) but, since Aj ^ 0, 
it gives us information about those Xi where Aji ^ 0. 

The assumptions of Theorem [1] are satisfied since the derivatives of the log posterior are 
bounded up to order 3 (Assumption S), Assumption C holds due to convergence in probability 
Vjfy(Ax*) = 1 — Yj/y*^ — > = Vjf y *(Ax*) as o — >■ for j Z, and the convergence 
assumption is true for the second order derivatives and the functions as well. Assumption 
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L is satisfied since both Poisson likelihood and the log cosh prior have exponential tails for 
large values of their arguments, the appropriate rescaling is either of the same order as 
(leading to the integral over the complement of a ball with radius tending to infinity) or 
smaller (leading to an integral with a factor tending to zero), hence the integral outside of 
the ball vanishes. 

The matrices U and V determining the change of variables can be taken as given by ffTP]) 
and in Proposition [2] respectively. Then, if p 3 > 0, b = 1 and the parameter of the other 
exponential distribution is given by a = A T Z l z = a y . 

For the log cosh prior with density defined by 01]), the prior precision matrix B(x) has 
the following non-zero entries: 

Note that B is singular, as the prior is improper (x is a priori identified only up an ad- 
ditive constant). The matrix [A T A : B] is of full rank since the null space of B, {v = 
a(l, . . . , 1) T , a G M}, is one- dimensional and A(l, . . . , 1) T 7^ since A does not have zero 
rows, and all of its elements are nonnegative. The precision matrices are given by 

B n = V? B(ar*)Vi, ft 00 = V T A T z dmg(l/[y*} z )A z y . 

a = n m W T A T z ^ + cn^VfVgix*) ~ M (cQ^V^g{x*), n,, 1 ) , 

where ^ ~ A/"(0, 1) for j e Z c is the limit of ^/^(Yj/y^j — l)/cr as a — » 0, ^ are independent 
due to independence of Yj, and ^ = (^1, . . . , £ n ) T - The random variable ao is centred at 
zero (and hence the posterior distribution of Uq(x — x*)/cr is not affected by the prior) if 
c = lim CT ^o S = 0. 



6.2 Ill-posed-ness and unidentifiability 



The results in Bochkina (2012) prove that, under a subset of our stated conditions, the 
posterior degenerates to the point x* in the small- variance limit. The proof assumes that 
the data model p(y\x) is correctly specified; in subsequent work we intend to relax this. 
However, in respect of the other key model component - the prior for x - no such assumption 
of correctness is made either in Bochkina (2012) or the present paper. The prior is regarded 
as the invention of the analyst, who, of course, has no knowledge of x trU e, the true value of 
x. Yet the point x* does depend on the prior, and so we need to understand the impact of 
the prior on the difference between x trU e and x*. 

The point x* is defined (Section 14. 31) as the point maximising the prior density p(x) 
subject to the non-negativity constraints x G [0, oo) p and the constraint that the model 
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fits the exact data: Ax* = y* = Ax true . Thus x* agrees with x true perfectly in directions 
orthogonal to the model hyperplane Ax = y*, i.e. P a t(x* — x tT uc) = 0. 

In contrast, parallel to this hyperplane, there is no information in the data about Xt rue) 
so x* is determined solely by the prior; there is no reason for (I — P^t)(x* — x trnc ) to be 
small. For example, for the Gaussian prior p(x) oc exp( — l/(27 2 )||a; — #o||b)> discussed in 
Section H~T} if the unconstrained maximum lim v ^.o(A T A + vB)~ 1 (A T Ax tmc + vBxq) satisfies 
the non-negativity constraints, then x* is given explicitly by this expression, and so equals 
Xtrue if and only if (J - P a t)x = (I - P A T)x tTUC . 



6.3 Practical implications of the approximate posterior 

In this section, we briefly discuss some practical implications of Theorem [TJ On a realistic 
scale where p is of the order 2 12 -2 16 we cannot hope to construct and manipulate p x p 
matrices such as V; this seems to rule out any practical use. However, there are well- 
developed methods for SPECT reconstruction using our model, using Markov chain Monte 
Carlo computation, delivering not only approximate, simulation-consistent, posterior means, 
but also variances; see Weir (1997) In this context, the theorem provides valuable knowledge 
which can enrich the interpretation of numerical results, enabling approximate probabilistic 
inference. 

Inferential questions of real interest, including (a) quantitative inference about amounts 
of radio-labelled tracer within specified regions of interest, or (b) tests for significance of 
apparent hot- or cold-spots, can be answered using approximate posterior distributions for 
linear combinations X T x of elements of x, and are particularly amenable to treatment in this 
way. More specifically, if for any non-empty set of pixels R C {1, 2, . . . ,p}, a R denotes the 
vector with elements a R = 1/\R\ for j 6 R, otherwise, then to deal with case (a) we can 
take A = a R to deliver \ T x as the average concentration of tracer in region R, and for case 
(b) take A = a Rl — a Ri to give the difference in average concentration in region Ri compared 
to i?2- To avoid bias in such inferences arising from the lack of identifiability caused by 
ill-posed-ness, it is important to check that A lies in the row space of A. 

The approach we propose would exploit analyis of MCMC output of X T x, together with 
a numerical MAP estimate x = aigmeLX x p(x\y) , calculated using an EM-based algorithm 
( IGreen 199011 . This can be used to identify x* and hence S, thus partitioning elements of 
x into those that are asymptotically normally or exponentially distributed; we anticipate 
Po ^ Pi,P2,P3 in practical situations. It will commonly be the case that j S for all j 
such that Xj ^ 0, in which case the theorem tells us that \ T x is asymptotically normally 
distributed, and its mean and variance, computed by MCMC, can be directly used to specify 
the approximating normal distribution and answer the inferential question posed. In the 
contrary case where Xj ^ for some j G S, a little work is needed to separate the asymptot- 
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ically normal and exponential components of x contributing to X T x. The simplest example 
of this would be a test for Xj = for a j such that Xj = and so j G S. For such j, Xj is 
asymptotically exponentially distributed a posteriori, with a parameter that can be readily 
estimated from the MCMC output. 



6.4 Finite sample performance 



10 20 30 40 



50 100 150 200 





Figure 2: First row: a sample from the posterior, contour map of acceptance rates, marginal 
posteriors for a section of 9 consecutive pixels through the image (row 12, columns 23 to 31), 
showing ground truth in blue. Second row: histogram of marginal posterior for a high-spot 
pixel (row 12, column 28), with truth indicated by blue line, and the same shown as a QQ 
plot against the normal and exponential distributions respectively. Third row, the same but 
for a low-spot pixel (row 12, column 31). 



Finally, we briefly discuss the extent to which the approximation in Theorem [T] holds 
true for data on the scale of a real SPECT study. A formal assessment of this would entail 
a major study beyond the scope of this paper, so instead we present selected results from 
an analysis of synthetic data based on a real SPECT scan of the pelvic region of a human 
subject. 



The matrix A was constructed according to the model in Green (1990) and Weir (1997) 



capturing geometry, attenuation and radioactive decay for a setup consisting of 64 projections 
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from a 2- dimensional slice through the patient, each projection yielding an array of 52 photon 
counts, corresponding to a spatial resolution of 0.57cm. Synthetic data was generated using 
this A and a 'ground truth' obtained from an approximate MAP reconstruction from real 
data. The total photon count was 62953; individual counts ranged from to 114, averaging 
18.9. Reconstruction was performed on a 48 x 48 square grid, with pixel size 0.64cm, using 
the log cosh prior with hyperparameters fixed at 7 = 25 and 5 = 8, was obtained using a 
simple MCMC sampler. We employed 20000 sweeps of a deterministic-raster-scan single- 
pixel random walk Metropolis sampler on a square-root scale for x, chosen to avoid extremes 
in acceptance rate at high- and low-spots in the image. 

Figure [2] shows selected aspects of this analysis; see caption for details. Our tentative 
conclusion from this is that the marginal posterior distributions for individual pixels Xj do 
appear to be approximately normal in high-spots and approximately exponential in low-spots, 
consistent with the theoretical limits presented in Theorem [TJ 



7 Discussion 

When the posterior distribution concentrates on the boundary, we have showed that the 
classic Bernstein-von Mises theorem, stating the limit of the posterior distribution recentred 
and rescaled by y/n for n independent random variables, does not hold. Instead, the limit 
differs in two respects, in directions towards the boundary: the limiting distribution is an 
exponential, and the appropriate scale is n, i.e. the convergence is faster. Parallel to the 
boundary, however, the classic version of Bernstein-von Mises theorem is applicable. Our 
results also extend the Bernstein-von Mises theorem to the case of non-iid observations and 
of a non-identifiable likelihood, for models that belong to the GLIP class. 



These are examples of nonregular problems, differing from those considered by Ghosal and Samanta (191 



Ghosh et al. (1994) and Chernozhukov and Hong (2004) In their case, the density of the er 



rors has a jump whose location depends on the unknown parameter; this is similar to a 
change point problem, whereas in our case there is a degeneration of the likelihood. This 
difference is reflected in the limiting distribution, that in the former case is shifted by a ran- 
dom variable that depends on data, whereas in our case there is no shift and no dependence 
of the exponential distribution on the data. 

The nonasymptotic version of the main result shows that other parameters of the model 
can also affect convergence in practice, such as the smallest eigenvalues of the precision ma- 
trices in the Gaussian part of the limit and the smallest parameter of each of the exponential 
distributions. 

There are interesting questions, beyond the scope of this paper, concerning the appropri- 
ateness of different prior formulations (as assessed from a frequentist perspective). Within a 
subjectivist Bayesian paradigm, real prior information is necessary for an informed choice. 



27 



An interesting direction for future work is to study both the behaviour of the posterior 
distribution, and the question of optimal prior specification, in a framework where the spatial 
resolution is infinitely refined, placing smoothness class constraints on x truc . 
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A Appendix: proofs. 



pi 



A.l Proofs, Section S] 

Proof of Proposition^ For arbitrary v > 0, suppose x G W is such that (A T A + vB)x = 
the zero vector in R p . We have to show that x = P . But (A T A + vB)x = P implies 
x T (A T A + vB)x = 0, and so by non-negative-definiteness of B and A T A, x T Bx = = 
x T A T Ax. But then Bx = P = A T Ax, and so x T [B : A T A] = 0^,. By the assumed full rank 
of this matrix, x must be 0„. 



Now fix Vq > 0. By Theorem 2 of Searle (1982), page 313, (with his A replaced by 
A T A + UqB), there exists a nonsingular real matrix P, not necessarily orthogonal, such 
that P T (A T A + u B)P = I and P T BP is the diagonal matrix A of the solutions for A to 
\B - \(A T A + vqB)\ = (which all satisfy < A < z^ 1 ). But then P T A T AP — I — u Q A and 
for any v > 0, P T (A T A + vB)P = I + (u — Uq)A, both of which are of course also diagonal. 

The matrix P can depend on the choice of u , but evidently always diagonalises A T A, 
B and any linear combination. Also, A depends on uq, but since \B — X(A T A + UqB)\ = 
(1 — \b>o) p \B — aA T A\ where a = A/(l — Az/o), the (diagonal) elements in A are Aj = 
ai/(l + ajZ/ ) where {ai} are the solutions to \B — a;y4 T y4| = (possibly some a, = +oo). 
So P T fiP = diag(oi/(l + Oii/ )), P T A T AP = diag(l/(l + a^o)) and for any i/, P T (A T A + 
vB)P = diag((l + anv)f(\ + a^o)). For the final assertions, note that v(A T A + vB)~ 1 = 
Pdiag(z/(1 + aiiVo)/(l + «jZ/))P T , which converges to Pdiag(5j)P T = C, say, where 5i = v$ if 
ai = +oo, and otherwise. Further, we can estimate the difference: u{A T A + uB)~ l — C = 
Pdiag(z/(1 + ajZ/ )/(l + a^u) — 5i)P T = i/Pdiag(0j)P r + 0(u 2 ), where 0j = if = +oo and 
otherwise 0, = 1 + a^o- 

Transformation by P scales and skews the result, but in a way independent of u, so 
the limiting behaviour of v(A T A + vB)~ l follows from the facts that the diagonal terms 
corresponding to aij = +oo have finite positive limits and the remaining ones scale as v. We 
see from P T A T AP = diag(l/(l + ct^o)) that the number of a« not equal to +oo is just the 
rank of A T A, i.e. rank(A). Thus u(A T A + vB)~ 1 = C + Dv + o[y) as v — > as required. □ 

Proof of Lemma [3 The matrix V is of full rank if 

3 

V k w k = 0, w e M. po , w l e W\ w 2 e R+ , u; 3 e R+ 

k=0 

implies w k = for all /c. 

Multiply the above expression by [V zfy*( x *)] T Az, we have that 

0=[V z f y *(x*)fA z y 2 w 2 

Since vector V 2 T ' A^V z fy* (x*) has positive coordinates and w 2 has nonnegative coordinates, 
the condition holds only if w 2 = 0. 
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Multiplying the above expression by V Q 7 A t z A z ^ and use w 2 = 0, we have that 

= V T A T Z A Zi [V w + V 2 w 2 ] = V T A T z A z V Q w 

Since matrix V 7 A t z A z Vq is of full rank, the above condition implies Wq = 0. 
Now we multiply the condition by Vg(x*) T and use w q = and w 2 = 0: 



3 



= Vg(x*) T V k w k = Vg{x*) T V lWl + Vg(x*) T V 3 w 3 = ( T V lWl + ( T V 3 w 3 = ( T V 3 w 3 . 



Since matrix [Vi]g c [K]s c , is positive definite, this condition implies W\ = 0. 

Thus, we showed that the p x p matrix V has p linearly independent columns, thus it is 
of full rank. 



Proof of Proposition Conditions (TT5|) state that p x p 1 matrix V\ consists of p\ linearly 
independent columns that satisfy AV\ = and ( T Vi = 0. If ( = (i.e. Vg(x*) is in the 
image space of A T ) then V\ consists of the p\ = p — rank(A) vectors that form a basis of the 
null space of A. If £ ^ 0, then ( is linearly independent of the columns of A and V\ consists 
of the p\ = p — rank(v4 T : () = p — rank(v4) — 1 vectors that form a basis of the null space of 



Therefore, if ( = 0, p 3 = 0, otherwise p 3 = 1 and V 3 is a vector in the null space of A 
that satisfies V 3 T ( > 0. 

1. Introduce the change of variables w = U(x — x*), where U = V^ 1 is defined in the 
statement of the lemma. Condition UV = I is equivalent to UkVk = I Pk and UkVj = for 
k^j, i.e. 



fc=0 




0, and use wq = 0, 



□ 



(A T : C) T - 



A Zo ,V = I Po , A Z2 V 2 = I P2 , C/iT4 = / Pl , ( T V 3 = 
AV 1 = 0, AV 3 = 0, A Zo V 2 = 0, A Zat V o = 0, 

( T v 2 = o, c T v = o, c T y = o, 



1 



(29) 



Thus, to show that all conditions ( fl5i) are satisfied, we need to show that 



Vq A T Z o A Zo V is positive definite, 
V 2 ' A^V z f y *(Ax*) is a vector with positive coordinates. 



31 



Recall that V jf y *{x*) > for all j E Z. 

If the matrix Az. is of full rank, then Z 2 = Z and V 2 T A' Z = I P2 , hence the latter 
condition is satisfied. Otherwise, by Caratheodory's theorem (p. 37 of Bertsekas (2006)), 
3Z 2 C Z such that vectors {A,-,, j E Z 2 } are linearly independent and define the cone 
I w = ^2j e z Aj^j', ftj > o|, i.e. for any j E Z, Aj t can be written a linear combination of 
vectors Aj , j E Z 2 with nonnegative coefficients, in particular, for Z 22 = Z\Z 2 , Az 22) = PAz 2t 
with > 0, i = 1, . . . , \ Z 22 \, j = 1, . . . ,\Z 2 \. Then, vector V 2 A\ ^ zf y * (%*) has positive 
coordinates 

V 2 T A T z VzU (**) = VzJ y *(x*) + (3 T V z Jy*(x*) > 0, 

where the inequality is componentwise. 

By definition of Z , there exist matrices a and a 2 such that A z ^ t = a A Zo , + a 2 A Z2 ,, 
in particular \Z C \ x po matrix ao is of full rank. Therefore, Az^Vq = and the matrix 
Vq A T ZC A Z cVq = a[a is of full rank, hence the first condition is also satisfied. 

Columns of matrix (A^ , A T Z ^ , V±, () are linearly independent and span W, hence, matrix 
Ui can be written as a linear combination of the columns of this matrix, 

Ui = 5koA Zo , + 5 k2 Az 2 , + tffciVi + 5 k3 ( T . 
Conditions ([29]) imply that U x = V? . 

2. Columns of matrix (A% : A T Z2 :V\\ Q are linearly independent and span IR P , hence, 
any matrix V k T can be written as a linear combination of the columns of this matrix, i.e. for 
k = 0,2,3, 

Vk = Al 5 k0 + A T z J k2 + V x 8 kl + C4s, 

and the same holds for Uf . By the conditions (129!) . multiplying the expression for V 3 by 
A Zo ,, A Z2 ,, Vf and C T implies that 5 31 = , 5 33 = 1/||C|| 2 , 5 3m = -{A Zm A T z J- x A Zm ^ 33 for 
m = 0,2, implying V 3 = C/IICI| 2 > where 

^0 = P A * 0i , h,o = (I-P )A T Z2j (Az 2 ,(I-Po)Al 2 )- 1 A Z2 ,(I-Po), C = (I-P2,o)(I-Po)C 
For (k,m) E {(0,2), (2,0)}, multiplying the expression 

v k = A Zki 5 kk + A Zm S km + Vx5 k x + (,5 k3 
by Az k! , A Zmt , V± and ( T and using conditions (|29|) . we obtain 

V k = (I- P m ,<;)A T z JAz k ,(I - P m>( )A T z J-\ 

where P c = CC7IICII 2 and P u = (/ - P ( )A T z JA Zk ,(I - P c )A Zk ]- 1 A Zk ,{I - P c ). 

3, 4. Now we study the image of map S. 
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Since [y*]z* = and x* ^ for all j E S* c , we must have A z *,s* c — since 

= [y*]z* = Az*,S* x *s* + Az* ,5* c ^5*c = ^4z*,S* c ^s*c 

The values of U2(x — x*) = Az 2 ,( x ~ x*) = Az 2 ,s*%s* are positive since Az 2t s* c = and 
Az 2 ,s*x* s * = is the lower boundary point of AX. The KKT conditions imply that for 
j E Z*, [Ax*]j is a boundary point of AX, and therefore A^{x — x*) = Ajx — [Ax*]j > 0, 
since we assumed that is a lower boundary point of AX. 

If P3 0, U 3 (x - x*) = ( T (x - x*) = £§.x s .. If S ^ S* and S is empty then (J* = and 
hence the image of U 3 (x — x*) is zero, a single point. If S is not empty then the image of 
U 3 (x — x*) is [0, oo) since Xj > and Q > for j E S. 

If we can choose Z such that Z fl (Z* \ Z) — and S'* c 7^ 0, values of C/ (£ — #*) = 
Az ,( x ~ x *) can be both positive and negative for x E X since A Zo j 7^ for all j E S* c 
(otherwise matrix A would have a zero column). 

If Z ^ Z* and Z n Z* ^ 0, then for j E Z* n Z , A jtS *c = and the values of 
[C/o]j,( :r — x *) = Aj t s* x s* can only be nonnegative for x E X, since we assumed that AX has 
only lower boundary points, that could only be zeroes. Denote Zq = Z* fl Z and z = \Zq\. 

Values of Ui(x — x*) can be both positive and negative if \U\\^s* c i that is, [Vi]s* c ,i , is 
nonzero for alH E 1, . . . ,p±. This is equivalent to any solution v to the equation (A T : () T v = 
satisfying vs*c 7^ 0. The condition is equivalent to 

= l(A iS *v s *) T : v^Cs*} T + [(A, s .cv s .c) T : 0] r . 

Since A z * t s* c — 0, vs* is a solution of (^4j?*,s* : (s*) Tv s* — 0. The number of linearly indepen- 
dent nonzero solutions v s* is \S*\ — rank^A^* 5 , : (s*)) = \S*\ — Vi— P3 — z > 0. The remaining 
condition is A^vs^ = —A^vs*. Note that since A Zot are linearly independent vectors that 
are not in the range of A T Z ^ and Az*,s* c = 0, this condition is equivalent to Az \z*,s* c Vs* c = 
(using column elimination), and also rank(Az 0t s* c ) + ran k(Az 2 ,S'*) = ran k(^z nz 2 ,) = rank(A) 
and hence r&nk(A Z(h s* c ) — ran k(^z \z*,s* c ) = Po- Hence, the number of linearly independent 
nonzero solutions t>s* c of the above equation is \S* C \ — rank(Az \z*,s* c ) = P — \S*\ — Po- 

Thus, if the number of linearly independent nonzero solutions Vs*<= p — 15**1 — po is equal 
to pi, i.e. if \S*\ = P2 + P3, then the range of U\{x — x*) includes both positive and negative 
values. If p — \S*\ — po < p\ (i.e. \S*\ > P2+P3 + z), then, s = \S* \ — P2 — p 3 rows of [C/i] ; s*c 
are zero (say, the last s rows) (thus, [Ui\ 1:s> s*c = and [Vi]5* C)1:s = 0). Then, for £ E 1 : s, 

[ Wl ]t = [Ui]t t {x - x*) = [Ui]^s*xs* + [Ui]^sA x s*- - x * s *c) = [Ui] t ,s*xs* = [Vi\s*,e x s*- 

Then, if [Vi]sv includes both positive and negative values, the range of [wi]e includes both 
positive and negative values. If [Vijs*^ has only positive or only negative values, then the 
range is either nonnegative or nonpositive. 
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Now we also need to check whether there is a constraint on v k arising from the constraints 
on x. 

Constraints: x E - x\ = x e = a[V ] e , v + 7^]^ v l + a 2 \V 2 }t, v 2 + 7 2 [V 3 } e v 3 > for £ G S*. 
In the limit, the dominating order is 7, hence for £ G S* such that [Vi]^ 7^ 0, the constraints 
imply [Vi]tvi = [Vi]e tl:s [v i] 1:3 > 0. 

For £ £ S* such that [Vi]^ = 0, the dominating order is 7 2 , if c = lima/7 2 = 0, an d 7 2 
and a if c > 0. If c = 0, for £ G 5* such that [V^ 7^ 0, the constraints become [V^i^ > 0. 
Thus, if all nonzero values of [V^ for £ E S* are positive, the constraint is v 3 > 0. However, 
if [V 3 ]s* has both positive and negative values, the constraint implies w 3 must be zero, thus 
we have the degeneracy of the support of v 3 . If c > 0, the same holds for £ G S* such that 
\V 3 ] e ^ and [V ] e , = 0. 

If c > 0, for £ G S* such that [V 3 ]i 7^ 0, the constraints become [V^i^ + c[Vo]^i>o > 0. 

Thus, if c = 0, then V* is defined by 

V* = {(v , Vl , v 2 , v 3 ) G M P0+P1 - S x R*+ P2+P3 : [V k ] Sk y k > 0, fc = 0, . . . , 3} 

where the inequalities are component-wise. If c > 0, then V* is defined by 

V* = {(v ,v 1 ,v 2 ,v 3 ) G F«+^-xlf 2+Ki : c[V ] Sos v +[V 3 ] So3 v 3 > 0k[V k } Sk v k > 0, k = 1,2} 

where 6*03 = So U S 3 and the inequalities are component-wise. 

In particular, the inequality on V\ is only on the last s components: 

[^l]si,(pi-s+l):pi[^l](pi-s+l):pi > 0. 

□ 



A. 2 Upper and lower bounds on the log posterior density 

We give two lemmas that provide random and nonrandom upper and lower bounds on the 
log of the posterior density. 

Lemma 2. Let B s = B 2 (0,5 ) x £ 2 (0,<?i) x 5^(0, 5 2 ) x 5^(0, 5 3 ), B s (x*) = {x G X : 
V^(x — x*) G Bs, and denote 5 + = \ \V\ |oo[^oa/Po + ^ia/Pi + S 2 + 5 3 ]. Denote also 

8a = ^o| I -002 1 1 2,oo + £l| I-B121 1 2,oo + 7T I \B 22 \ loooo + 5 2 5 + ( ^- , 

2 2p 

5 2 K 

8b — ^0 1 1-^031 1 2,oo + ^1 1 1-^13 1 1 2,oo + ^2) I-B23I |oo,oo + I-B33I |oo,oo + 5 + 5 3 — , 

where = ViHVj and = ViBVj , i,j G {0, 1, 2, 3}, q k = ||\4|| 00,00 for k = 2, 3. 
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1. Upper bound. Then, for x G Bg(x*), we have the following upper bound: 
[h y (x) - h y (x*)}/a 2 < (a y + 5 a l) T v 2 + (b + 5 b l) T v 3 + \\hU 2 (v - H^Vh y (x*)/a)\\ 2 /2 



+ \\B{[ 2 Vi \\ 2 2 /2 - — 2 \\H 00 1/2 Vh y (x*)\\ 2 + ||B 10 || W<A 

where B n = B n + 5 1 k b I Pi , H oq = H 0Q + 5 q k a I Po . 

2. Lower bound. For x G B&(x*) and small enough 5 k and v , we have the following 
lower bound: 

[h y (x) - h y (x*)}/a 2 > (a y - 5 a l) T v 2 + (b - 5 b l) T v 3 + \\H^ 2 (v - H^Vh y {x*)/a) || 2 /2 



- ^- 2 \\Hw lz Vh y {x*)\\ 2 - ll^ioll W^, 



r-l/2 



2a 2 

where B n = B n - 5xK B I Pl , H 00 = H m - 5 k a I, 



L Po- 



Proof. Approximate h y (x) by a quadratic function using Taylor decomposition in a neigh- 
bourhood of x*: 



1 



We start with looking at the gradient: 

V/^x*) = P A T(Vf y {x*) + uVg(x*)) + u{I-P A T)Vg(x*). 

Using the properties of the local geometry described in Section 14.31 and the representation 
x = x* + J2l=o v kW k , we have 

3 

(x-x*) T Vf y *(x*) = ^V k w k ] T VfAx*)=wlV?A z ^ z f y *{Ax*) = a T w 2 , 

k=0 

(x-i*) t V/ s (i*) = wZV 2 A z y z f v {Ax*)+wTV Vf v {x*), 
due to tffVf y *(x*) = 0, AVi = and AV 3 = 0. Also, 

3 

(x-x*) T Vg(x*) = J2 w l rV k T Vg(xl=WoVoVg(x*)+wTV?Vg(x*)+wJVi'Vg(x*), 

due to ViVg(x*) = 0. Combining these expressions together, we have 

(x - x*) T \7h y (x*) = WQVj''Vhy(x*) + a y + vw^b. 

Now we bound A o for w = V~ 1 (x — x*) G B$ using Taylor decomposition of h y (x): 
3x c G (x, x*): 



|Aoo(5)| 



?y ijkhy{x c ){xj - x*)(xj - x*)(x k - 

ijk 

d 



dzi 



[(x - x*) T V 2 h y (z)(x - x*)] 
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Note that 

3 

tx - x*) T V 2 hy{z){x - x*) = Yl < V k V 2 h y {z)V jW] 

k,j=0 

3 

= E ^V k T V 2 f y (z)V jWj + v Y. ™lV k T V 2 g(z)V jWr 

fcj=0,2 k,j=0 

Differentiating with respect to z and bounding the third derivatives of f y and g using 
Assumption S, we have that for every i, with high probability, 

\w^V k T V t V 2 f y (z)V jWj \ < C/sH^llill^fclli- 

Then, we can use inequalities ||T4iUfc||i < y^l | VfcU>fc| | 2 < v^H^H 2 ^ or ^ = ^' ^' an< ^ 
||VfcW fc ||i < | |w fc | |i| | Vfcj |oo,oo for k = 2,3. Denote g fc = |]T4||oo,oo for fc = 2, 3. Then, 

3 

Kx-x^fViV 2 ^)^-^*)! < c /3 + Yl ll^llill^lli 

fc,j=0,2 fcj=0 

3 

< 2C /3 [||K,^o|| 2 + ||^ 2 || 2 ] +4z/C 33 El|Vfc^|| 2 

fc=0 

< 2p(C /3 + 2i/C fl3 ) IKH2 + 2g 2 (C /3 + 2uC g3 ) \ \w 2 \ \\ 
+ AvpCgzWunWl + 4vC g3 ql\\w 3 \\l. 

Therefore, using the constants Ka and Kb defined in the lemma, we have 

|A oO)| < -\\x - x*||imax|(x - x*) T ViV 2 h y (z)(x - x*)\ 
6 » 

< ±<f + [(^N^olll + 5 2 g|||^ 2 || 1 )(2C f /3 + 4^C 53 ) + 4^ 3 [p||^i||l + ^Hl^alli]] , 

< y [«a||wo||1 + KsHkllll + «A?l5 2 ||w 2 ||i/p+ ^S?3 5 3|k3||l/P] , 

since = < ||F|UHIi < IMUv^o + + S * + S d = S +- 
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1. The upper bound. Making the change of variables v = S(x — x*), we have 

[h y {x) - hy(x*)]/a 2 < ayV 2 + b T v 3 + ^VQH 00 vo-VQVh y (x i ')/a+^v{B 1 iv 1 
+ y/vv^B 01 V! + 5 + [k a 1 1 v 1 1 2 + k b \ \ v x \ | 2 ] / 2 
+ [<n%V Q + -fViVx] BV 3 T v 3 + ^B 33 v 3 

2 

+ [<jv$V + jv^V! + i 2 vlV 2 ] BV 3 T v 3 + °—vlB 22 v 2 

+ y [^^gi^slbslll/P + KylAll^Hl/p] 

< (b + 5 b l) T v 3 + K + 5 a l) T v 2 
+ 2^ + MY B io«o) r -Bn(«i + V^B^BwVo) 
+ ifo - H^Vh y {x*)/a) T H m {v Q - H-JVh y {x*)/a) 
' >\H~ 1/2 Vh y (x*)\\\ 



2a 2 

since ifjfc = vBjk if at least one of j, A; is 1 or 3. 
Therefore, an upper bound is given by 

[h y (x) - h y (x*)}/a 2 < (b + l5 b ) T v 3 + (a y + l8 a ) T y v 2 + \\H^(v - H^Vh y (x*)/a)\\ 2 /2 

+ \\Bll 2 ( Vl + ^B^B 10 v )\\l/2 - ^- 2 \\H- 1/2 Vh y (x*)\\ 2 . 

2. The lower bound. A similar argument leads to the following lower bound: 

[h y (x) - h y (x*)]/a 2 > (6 - 5 b lfv 3 + (a y - 5 a l) T v 2 + ||## 2 (v - H^V h y (x*) / a) \\ 2 /2 



WBgfa + V^B^B w v )\\ 2 2 /2 - — 2 \\H m l/2 Vh y {x*)\\ 2 . 



□ 



Lemma 3. In the notation of Lemma^ introduce the following events 

Ax = {u : ||Vf \yf Y(u) {x*) - VfAx*)]\\co < 25a}, (30) 
A 2 = {u: \\V T [V 2 f YH (x*)-V 2 f r (x*)]V \\<2K A 5 }. (31) 

Then, on A\ fl A 2 , for x G B$(x*), 
[h y (x) -h y (x*)}/a 2 < (a + 5 A l) T v 2 + (b + 5 b l) T v 3 + v^n 00 v /2-v^Vh y (x*)/a 
+ HBf/Vlll/2 + ||S 10 ||<W^ 



[h v (x)-h v (x*)]/a 3 > (a-S A l) T v 2 + (b-5 b l) T v 3 + v^n 00 v /2-v^Vh y (x*)/a 
- ll^Vll^-ll^oll^/a 2 , 
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where 

6 A = 38 a + v\\V?Vg(x*)\\, 

^oo = ^oo + (3ka<^o + ^A max (-B o))-f, 
^oo = ^oo — (3«a^o — v\ min (B 00 ))I . 

Proof. There are two random leading terms in the expressions of the upper and the lower 
bounds, a y and H 00 . The lower bound has positive (or positive definite) coefficients if 

5 > 0.5/^1 \V?[V 2 f y (x*) - V 2 /^(^)]K,||, S > 0.5z/k^||Soo|| 
6 a > 0.5\\V?\yf y (x*) -Vf y +(x*)}\\, 6 a > 0.5v\\V?Vg(x*)\\. 

The random part of these conditions is satisfied on A\ fl *4. 2 . 
On Ai, 

a y -a = V 2 T [Vf y (x*) + vVg{x*)\ - Vf V/^x*) < (26 a + z/||K 2 T V(?(^)||)l 

where the inequality is componentwise, and similarly a y — a> — (25 a + z/||V2 T Vg(x*)||)l. 
On event A2, Weyl's inequality implies 

Afe(ifoo) > \k(V^A T V 2 fy(x*)AV ) + z/A min (5 00 ) > A fe (fi o) - 2k a 5 + u\ min (B 00 ), 
Afc(-f^oo) < Afc(^oo) + 2k^5q + ^A max (S o), 

where Afc(M) is the fcth largest eigenvalue of matrix M. Applying these inequalities to the 
bounds in Lemma EJ we obtain the statement of the lemma. 

□ 

A. 3 Proofs, Section 15.31 

Proof of TheoremUl Consider a neighbourhood of x*, Bs(x*) = (x* + VB$) D X, where 
B s = B 2 (0,6 ) x £ 2 (0,5i) x 5^(0, 5 2 ) x 5^(0, <J 3 ). Denote u = D~} y V~ 1 (x - x% with the 
Jacobian of this change of variables being J = <j~ Po ~ 2p2r y~'P 1 ~ 2p ' 3 / det(V). 
Denote pk the smallest rate such that for some > 0, as o — > 0, 

P{|A»(x*) - fA**)]\ > M oPo} -> 0, 
P{||Vf [V/y H (x*) - V/^^HU > M lPl } -> 0, 
P{HK. T [V 2 /yH(^) - V 2 /^)]|| > Af 2P2 } 0. 

Due to Assumption C, pk — > as a — > 0. 
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For the rescaled parameter, we use the corresponding neighbourhood B R and its limit 
B R defined by 

B R = [B 2 (0,R )xB 2 (0,R 1 )x[0,R 2 ] P2 x[0,R 3 ]^}nD- 1 V- 1 (X-x*), 
B R = [B 2 (0, R ) x 52(0, R x ) x [0, R 2 ] P2 x [0, R 3 ] P3 \ n V* 

where 

R = So/a, Rx = SJj, R 2 = 5 2 /a 2 , R 3 = 5 3 / 7 2 - 
We assume that 5k are such that 5k — > and Rk — > oo. In addition, we will need conditions 

Ro = o(j/a), R 2 = o( 1 /a 2 ), #3 = 0(1/7), i? 2 = o(l/a), 

and i?2 = o(7 2 /a 2 ) if c > 0. These conditions hold, for instance, with = Ck[— log ex]"* for 
some positive constants Cfc and a&. Under these conditions, C(5) defined in Lemma 0] tends 
to as cr —7- 0, that will be used below. 

The triangle inequality for the total variation norm gives us 

lF5(»-a!*)|y — A* Iky < \\^S{x-x*)\y1b* r — A* Is* 1 1 TV 

+ H/^Ib* — H*\\TV + \\^S(x-x*)\Yl-B* R — ^s(x-x*)\y\\tv, (32) 

where the balls B R are defined above. Here A 4 1b* is a probability measure \i truncated to 
B* R and normalised to be a probability measure. 

If measures a*i, A*2 are absolutely continuous with respect to some measure /i with densities 
/ and g respectively, then the total variation norm can also be written as 



Hi-H2\\tv = 2 / (f-g)+dn, 
J x 



where (x) + = max(x, 0) (Ivan der Vaart 19981) . In each of the summands in the upper bound 
fl32|) . the first measure is absolutely continuous with respect to the second one, so we will 
use this expression to evaluate the total variation norm. 

We start with the distance between the truncations of the rescaled posterior distribution 
and the limit on B R . Introduce additional notation: 

a =a + <5yil, b = b + 5bl, 
a = a — 5^1, b = b — 5^1, 

and //(•; x, f2, B, a, b) is a measure on V = IRpo+p 1 x R^ 2+P3 , such that for z — (zi, z 2 ) 6 V, 

\i{dz\ a, E, (5) = exp { — | |£ 1//2 ,2i| | 2 /2 + z[a — (3 T z 2 } dz. 
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This measure is finite if matrix E is positive definite, a is finite and all components of vector 
(3 are positive. If B v = V = W 0+Pl x R p _? +P3 or B v = B x x B w (0, r 2 ), we have 

P2+P3 

/i(V;a,S,/3) = J] /Jr 1 [det(S)]- 1/2 (27r)( po+P1 )/ 2 exp{a T E- 1 a/2}, 

i=l 

P2+P3 

x 5^(0, r 2 ); a, £,/3) = /i(V; a, £, /3) Tr x a, X" 1 ) J] [1 - exp{-/3 i r 2 }], 

8=1 

where $(B;a, Q) is the measure of B under the Gaussian distribution centred at a with 
covariance matrix Q. 

If B v degenerates to a manifold of a smaller dimension, then we slightly abuse the notation 
and assume that \i has mass 1 on the degenerate part of B v , i.e. we replace the Lebesque 
measure in the definition of \i with the counting measure on the degenerate part. 

By Lemma El on Ai D A 2 for any B x C B s (x*), with B v = D' 1 V' 1 (B X - x*) C we 
have 

/ exp { — [h y (x) — hy(x*)]/(T 2 } dx > J I exp j— 6 T t> 3 — a^t^j 

x exp |-||fio^ || 2 /2 + vlVh y {x*)/a - \\B\{ 2 v x \\l/2 - v^An^i} dv 
= J jj,(B v ] a y , 

where a y = ([Vh y (x*)] T /a, Of, fi = (a T ,b T ) T , E = f 

yV^-Sio #u 
Similarly, using Lemma [31 we obtain an upper bound: 

/ exp { — \h y {x) — h y (x*)]/a 2 } dx < J / exp {— a T v 2 — b T v^ 
Jb x " Jb v 

x exp j - 1 1 n^vo \\ 2 /2 + v^Vh y (x^/a-\\ B 1 1 [ 2 v 1 1 1 2 /2 - ^v^B 0X v x } dv 
= J/j,(B v ;ay,Y,,j3), 

where (3 = (a T ,b T ) T and £ = ( 00 

\^fvBxQ B xx 

To simplify the notation, denote 

p(dv) = fj,(dv; a y , £, (3), Jl(dv) = p,(dv; a y , E, 

Define event ^3: 

A 3 ={co: ||fiooOoH -<7/ 7 2 V Q T V<?«) -^V^Vfv^ix^Woo < p} 

where p is the smallest value such that P(^4.3) — > as a — > 0. On A3, ||V/iy( w )(a;*)/c7 — 
fioo a o(w)||oo < P- Therefore, on A3, measure Jl is finite since a y is finite, and all other 
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parameters are positive or positive definite. Measure \i is finite on A3 if 5b < b min , 5a < a m m, 
5\ *C A m i n (B u )/k b and 5 < A min (f2 o)/(3/tA)- 

Hence, the posterior density of S(x — x*) normalised by the posterior measure of B R is 
bounded on A\ fl A 2 by 

Jx{dv) dp(S(x — x*) I Y) ^ p,(dv) 



jl(B R ) 



p(B R I Y) ~ ji(B$ • 



Therefore, the total variation distance between the rescaled posterior distribution and its 
limit, both truncated to B R , is bounded on Ai fl A2 by 



J 5(x-x*)|yls* — ||tv < 2 / 



< 2 



- 1 



li*(dv) 



Now, = ggfrgffj where a* = (a^ o,0) T , 0* = (a T ,6 T f , E* = diag(Q 00 , S u ). 

Denote Ho(dv) = /i(dv] E*, a*, (3*). Then, with v i = (w^,wf) T , 



A*o(du) 
which implies 



exp{5 A l T t; 2 + <U T ^3 + %(E* ~ S )W 2 + - a*)}, 



exp{(v 01 + (E* - E)- 1 ^, - a*)) T (E* - E)(uoi + (E* - E)- 1 ^ - «*))/2} 



x exp{<5 A l T t; 2 + <5 b l T ^ - (% - a*) T (E* - - a*)/2} 

/ B * exp{-a T w 2 - 6 T w 3 }exp{-||E* 1/2 t;oi|| 2 /2 + v^a^dv 



x 



exp{-(a + 5 A 1)^ 2 - (b + 5 b l) T v 3 } exp{-| |SV2 Uoi ||2/ 2 + v^a y }di 



To show that this expression is greater than 1, it is sufficient to show that for any B C {v 01 
(voi,V2,vs) G B R }, the following expression is positive: 

e -(a„-a*) r (E*-£)-i(a B - a *)/2 /* e -||E* 1 /2 z ||2 /2 +^ q * } ^_ /b e xp{-||EV2 z ||2/2+^a !/ ^ 



3 -||El/2 2 ||2 /2+z T Qa 



'X 



x e 



which is indeed the case since 



z T (T,-T,*)z/2+z T (a* -a y )+(a* -a y ) T (S-S 



*) _1 («*-"!/)/ 2 _ i] dz > 







E - E* = diag ([3k a ^o + vKm X (B 00 )]I P0 , S^bI^) , 
(E - E*) - (E - E*) = E — E = diag ([Qk a 5 + u(X max (B 00 ) - A min (5 00 ))]/ po , 26^1^) 
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are positive definite matrices. 



Therefore, on A\ fl A 2 , 



fMdv) Mo(-Bfl) 



(i (dv) v(b r ) 
^S{x-x*)\Y^-B R — K^B r \\TV < 2 



> 1 and hence 



fi(dv) /i*(-B 



R) 



H(B* R ) fj,*(dv) 



H*(dv) 
K(B_ 



R) 



fi(B 



R) 



1KB 



R) 



Since u J (dv)/dv > u J (dv)/dv for all v G B R , KBr) — KBr) < KBr) — h(Br), and thus 



KB 



R) 



1 < 



KBr) - K b r) K b r) - K b r) 



KB R ) KBr) 

The difference of measures of Br is bounded by 



KB 



R 1 



KBr) - K b r) 



j-zfi;z 1 /2+z'[ay-l3 T Z2 



exp { zl\E - E)zi/2 + (p - (3) T z 2 



dz 



< 
< 



z((t - E) Zl /2 + {13- 0) T z 2 \ e -4^/^i^-^^ dz 
(27r)^ + ^)/ 2 [det(S)]- 1 / 2 e< 2 ~ 1 ^/ 2 J]^- 1 



x 



IKE - Hf'^^ayW 2 + trace(5r 1 (£ - £)) 



25 A p 2 25 b p 3 



+ 



min, a, min, h 



< (2 7 r)(«' + ^)/ 2 [det(S)]- 1 / 2 e Q " S " 1 ^/ 2 n4" 1 



d i[||S CKy| I +trace(S )J H : — — H r 



minj aj minj h L 
due to inequality e a ' — 1 < xe x for x > 0, where 

5 01 = max([6K A S + v{X max {B 00 ) - \ min (B 00 ))], 25 1 k b ). 

Therefore, 

\\V s{x - x * W 1b r -K1b* r \\tv < 2[/I(^)]- 1 (27r)^ + ^)/ 2 [det(S)]- 1 / 2 e< 2 - 1 ^/ 2 nA~ 1 



x 



x rnv-i l|2_L+ ^v-im 1 2( ^ 2 2( ^ 3 
doi[||S CK y | I + trace(2J )] H — + 



&min — ^6 



which goes to zero since 6k — > as a — > 0. 

The total variation distance between the posterior distribution truncated to Br and to 
B R is bounded by 

||Ps(a;-a:*)|KlB^ ~ ^S(x-x*)\Y~1-B r \\tV < ^S{x-x*)\Y^-B r ]{Br\B r ) 



< 



2u jL (B r \ B R ) max v€ B R [Kdv)/di 
KBr) 



where ^l(B) is the Lebesgue measure of B. By Lemma III [i L (B R \ B R ) — )■ as a — > 0. 
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For the measure a, E, ft), max zeV [fi(dz; a, E, ft)/dz] = exp{a T E 1 a/2}. Therefore, 



m & x veBR [fi(dv)/dv} „Ts-i(E-2)2- laa/2 P ?f 3 R. bWV^ 1/2 



?(V) 



II A [det(E)] (27 

The difference E — E = O(max(5o, Si)) — > 0, on A3, E _1 a y and E -1 *^ are bounded as well 
as ft. Now we need to show that the last factor is bounded. 

Choose a small enough so that B R = B(0,R ) x B(0, R x ) x [0, R 2 } P2 x [O,^ 3 . This 
condition is satisfied if ||C4x*|| > 5k for k = 0,1 and ||C/fcZ*||oo > $k for k = 2,3. If the 
degeneration of the support takes place, the degenerate dimensions are excluded. Due to 
5(0, ^R 2 + R\) C B(0, Ro) x B{0,Rt) and to inequality 1 - $(£?(0, R); a, E _1 ) < 1 - 
T^«|f),wehave 

KBr) / UgKyg + R\ - IIE- 1 ^!!) 2 go+ p/ 

ju(V) - ^ 2 1 2 

P2 P3 



£[[1 _ exp{-a 4J R 2 }] - exp{-kR 3 }} 



x 

j=l i=l 



which is close to 1 for large R k , k = 0, 1,2,3. Therefore, \\¥s(x-x*)\y1b r -^s(x-x*)\y1b r \\tv -» 
as a 0. 

The total variation distance between the limit measure and its truncation to B R is 
bounded by 

||//* - ^*1b* ||tv < 2fj,*(B* R ) = 2^(B R ) + 2(i*(B R \ B* R ) 

P2+P3 

< 2f, L (B R \B R )[det(^)} 1/2 (27r)-(^)/ 2 J] /?* 



i=l 

{Q 00 )(R - ||a ||) 2 , Po\ ^ -n / ^miri 

(Bn)R{ pi 



2 1 2 7 V 2 1 2 

/'J P3 

X 

i=l i=l 



J|[l _ expj-a^}] - exp{-6^ 3 }] -> 



as a —7- 0, since i4 — >■ 00 and Hl(B r \ B R ) — >• as a — > by Lemma @] where is the 
complement of B R . 

The total variation distance between the posterior distribution and its truncation to B R 
is bounded by 

\\1P(S(x-x*)\Y)1b r -~^(S(x-x*)\Y)\\TV < 2^(S(x-x*)\Y)(B R ) 

2 Ix\b s (x*) exp{-(/i y (x) - h y (x*))/(r 2 } dx 
f x exp{ — (h y (x) — h y (x*)) I a 2 } dx 
2det(V)[fi(B R )}- 1 A (6) 



< 



l + det(V)[Jl(B R )]^A (5y 
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where 

A (5) = a P0+2p2 j pl+2p3 [ exp{-[h y (x)-h y (x*)]/a 2 }dx. 

JX\B s (x*) 

By Assumption L, with probability — >■ 0, A (5) — > as cr — y 0, and Jjl(B r ) — )■ fi (B R ) > 0. 
Combining these bounds, we have that 

||Ps (rf )|y-/i1lTV < 2//(£ fl ) + 2 det(V0A (5) 

+ 2[ / S(5 i? )]" 1 (27r)( po+ P 1 )/ 2 [det(S)]- 1 / 2 e^ f: ' la! ' /2 JJ 4" 1 



x 



doil||S a„|| +trace(E )J H — — + — — - 



+ 2fM L (B R \B* R ) 



r *(a ma 1j max 1 , eSji [/i(dw)/c?t;] 
max[/i {avj /av\ + 



as a — > 0, which gives the statement of the theorem. 



□ 



Lemma 4. Under conditions of TheoremUl fi L (D X V 1 (Bs(x*) — x*) \ V*) < C(5) det(V x ) 
where 



C{8) = [||Vo||oo,2^o/7 + ||^2||oo,oo<5 2 /7 + H^IIoo^/t]'^^ 



I^IUooV^P 01 , C = 0, 

|^||oo,oc5 2 /7 2 ] |50| + |S31 , C>0, 



sets Sk are defined in Proposition^ and fii is the Lebesgue measure. 

Proof. We study the constraints on Vk under D^ 1 V^ l {Bs{x i ') — x*) \ V* using notation from 
the proof of Theorem [TJ 

We find the Lebesgue measure of VB R \ (VV*), then the Lebesgue measure of B R \ V* is 
&et(y- l )n L (yB R \ (VV*)). For £ G S u 

\Yx\iV-l > -(x/n/[Vo]e,Vo - o- 2 /7[V 2 k v 2 - l[V 3 ] e v 3 and [Vi]^t>i < 0. 

which is a subset of — -Ro^/ T 1 1 [Mj]^, 1 1 2 - ^VtII^UIU _ lR 3 \V 3 ] t < [V^Vi < 0. This set 
could be empty, or its Lebesgue measure could be up to [Rqct/jI \V \ + a 2 R 2 j^\ \V 2 \ 1 00,00 + 
7i?3[V3] 00 ]' ,Sl '. If Ro is chosen in such a way that Rqo/^ — > 0, a 2 R 2 /'j -» and 7-R3 — > 
(which is possible), the measure of this set tends to 0. 

For £ G S3 and c = limcr/7 2 = 0, > [V^"^ > — [VoU v o°~/l 2 - This set is empty, since 
either p 3 = 0, or p 3 = 1 and v 3 > and [V 3 ]i > 0. 

For £ G S 03 ando 0, > [V 3 ] £ v 3 + [V ] e> v > -[V 2 ] t , v 2 a 2 / 1 2 > -\ \V 2 \ \ ^ R 2 a 2 h 2 . This 
set is either empty, or its Lebesgue measure is at most [| | V 2 \ |oo i00 R 2 a 2 / i y 2 f s °^ s ^ provided 
R 2 a 2 h 2 -> 0. 

For £ G So and c = limcr/7 2 = 0, > [Vo^ito — — [^2]^ > — HV2II 00,00 RqP- This set is 
either empty, or its Lebesgue measure is at most [| | V 2 \ |oo,oo R 2 cr}^ provided R 2 a — > 0. 
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For £ G S* c such that [Vi]^ 7^ 0, the constraints are 



[Vi}ivi > -Xi/j + a/j[Vo)iv > -x\h + o-/j[Vo]eVo, 

and [Voj^wo > —x\ja if [Vi]e, = 0. Since there is no constraints on [Vi^^i and Vo]^i>o for 
£ G S* c , the difference V s *cB R \ (VV) is empty. 

Therefore, the Lebesgue measure of VBr \ (VV*) is at most 

[iW7lNU,2 + ^ 2 /7ll^||oo,oo + 7^3||^ 3 ||oo] 1511 + [11^1100,00 i? 2 a 2 /7 2 ] |5o|+|53 ' 
if c > 0, and is at most 



if c = 0. 



□ 



A. 4 Proof of Theorem on Bayes estimates 

Proof of Theorem^ The limit Bayes estimate Vq is measurable by the Jennrich's measur- 
ability theorem since it minimises the objective function that is continuous in data and 
parameters. 



To prove the theorem, we follow Chernozhukov and Hong (2004) and apply Theorem 



1.10.2 of Ibragimov and Has'minskij (1981) (p. 107), which allows one to obtain the limit 
distribution of the Bayes estimates provided the following conditions on the penalised likeli- 
hood ratio process £ T (v) = exp{ — (h y (x* + S^v) — h y (x*))/r} are satisfied. The results of 



Theorem 1.10.2 of Ibragimov and Has'minskij (1981) and the auxiliary lemmas apply since 



the factor exp{ — (g(x* + S x v) — gix*))/'-/ 2 } is non-random and is bounded on a compact 
neighbourhood of 0. 

1 /2 

1. Holder continuity of £ T (v) in the mean square, and the exponential bound on the 
expected tail of £ T (v). The first condition is that for any compact K C X 3C\,C2 that 
depend on K such that for any v,v' G K, \\v ||oo, H^'Hoo < R, 



E\£ T {v) 1/2 - £ T {v'Y' 2 \ 2 < d(l + R Cr )\\v - v'\ 



as 
oo 



for some a G (0,1]. 

The second condition is that any compact K C X 3g T (z) : [0, oo) — >■ (0, oo) such 
that for any fixed r, q T (z) increases to infinity as z increases to oo, and for any iV G N, 
\im T ^ z ^ 00 z N e- q ^ z) = 0, so that for all for all v G S~ 1 (X - x*), 

E£ T (v) 1/2 < e - qAM °°\ 
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These conditions are checked below. 

2. Finite-dimensional convergence of £ T (v) to the density of /x* is satisfied (Theorem [TJ). 

3. The limit Bayes problem, 



is uniquely solved by a random vector Vq. This condition is satisfied since Q is convex with 
a unique minimum, and fi* is a proper probability measure. In fact, this weaker condition 
on Q can replace the convexity condition. 

4. Conditions on the loss functions Q are satisfied. 

Now we check the first condition, using notation defined in the proof of Theorem [TJ By 
Lemma [3j for v G B R , on A = Ai fl A 2 fl A3, 



e T (vf 2 = exp{-{h Y {x* + S-'v) -h Y {x k ))/{2r)} 

< exp{-(a min - 5 A )\\v 2 \\ 1 /2 - (6 min - 5 b )\\v 3 \\ 1 /2 

- Km (fioo) I \v Q \ 11/4 - A min (B u ) 1 1 % 1 11/4 

+ ||uo||a (||fiooOo(w)|| +p)/2 + v^l l^iol 1 1,1 1 M U I Kl U/2}. 

Then, 

M T (v)^ 2 < expf-^dl^lU)}?^) + 1 - P(.A) = expf-^anU }(1 + 

due to P(.A) = P(A n A 2 n *4 3 ) ->■ 1 as r -> 0, where 

q T (z) = [a min - S A + b miQ - 5 b - ^(\\n 00 a (uj)\\+ p)]z/2 
+ [A min (fioo) + Kin(B n ) - 2v^||5 10 ||i,i]z 2 /4 

satisfies the required conditions for r small enough. 

The second part: by Lemma [31 using both upper and lower bounds on the log posterior, 
for v, v' G K C Br for a compact K, on ^4 = A\ fl ^2 H .A3, we have 

log {i T {v)/E T {v')) < -(a - 8 A l) T v 2 + (a + 5 A l) T v' 2 - (b - 8 b l) T v 3 + (b - 5 b lfv' 3 



- v^n 00 v /2 + v' T n 00 v' /2 - vlB ll v 1 /2 + v?B lx rfj2 

+ VqHqoXo - VqHqqXq + y/uV^BwVo - y/vV f B 10 v' 

< \\a\\i\\v 2 - Vzlloo + 6a\\v 2 + v'tllx + \ \b\\x\\v3 - v'zlloo + 8 b \\v 3 + v' 3 \\ 1 

- (v + v' ) T ^oo(v - v' )/2 + v?(n 00 - QooK/2 

- {v x + v'.fBniv, - v[)/2 + v[ T (B n - B n )v[/2 

+ (vq - v' ) T Vh y (x*) + - v[) T B 1Q v Q + yfiv?B 10 (vo - v' ). 



inf / 

• gRPO+Pl xK P2+P3 Jrpo+pi xM p j ? +P3 
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Thus, we can write 



k=0 



Taking 5k such that 5k — > and 5kRk — > 0, the last four terms in the exponent tend to as 
a — > 0. Therefore, using inequality e x — 1 < xe x for x > 0, for k — 0, . . . , 3, on A, 

|4(t>) 1/2 -4-(t/) 1/2 | < 4K) 1/2 (co(l+i2o + i2i)l|w'Hoo + o(l)) 

< (c (l + i? + i2i)||u - u'Hoo + o(l)) expl^dlu'Hoo)}. 

This implies that E | i T {y) x l' 2 —l T (i/) 1//2 \ < C| |t>— 1>'| |oo + o(l). Thus, conditions 1-4 of Theorem 
1.10.2 of Ibragimov and Has'minskij (1981) are satisfied and hence S(xq — x*) -4 Vq. 

□ 



A. 5 Auxiliary results 

Lemma 5. Let S be a nonempty subset of {1 : p} and denote = {x G : Xj > OVi G S 1 }. 

Then, a linear map M.$ — )■ Ms defined by matrix V is a bisection if and only if Vs,n(S) — 
diag(ai, . . . , a|s|) for some > and Ks,s c = /or some permutation it of S. 

Note that this statement also holds for V^ 1 . 

Proof of Lemma{5i We need to show that (Vx)k > iff Xk > for each k E S. Then, we 
need to show that 

V s ,sx s + V s ,s°xs° e [0, oo) 151 iff x s G [0, oo) 151 

which holds iff Vs,s c — and Vs,s is an invertible matrix with nonnegative entries. Denoting 
U = V^ 1 , these conditions imply that Us 7 s — [Ks.s 1 ] -1 and Us t s c = 0. 

Now, matrix V" 1 must satisfy the same conditions, i.e. Us,s° — (which is satisfied) and 
[V -1 ]^^ = [Ks 5 5] _1 is an invertible matrix with nonnegative entries. 

Thus, we must have that both Vs,s an d tys,s] _1 have nonnegative entries. We can prove, 
for instance, by induction, that this implies that both Vs t s an d [Vg^]" 1 have to be diagonal 
matrices with positive eigenvalues, up to a permutation of the coordinates in S. For \S\ = 2 
it is a necessary and sufficient condition (base of the induction). Suppose the statement is 
true for |,S| = m; then, for \S\ = m + 1, nonnegativity of the matrices' elements implies 
that both matrices can be written in the block form of sizes m and 1. For each matrix, the 
block of size one must be a positive number, since the matrices are invertible, and the block 
of size m must be a diagonal matrix with positive eigenvalues, up to a permutation of the 
coordinates. This implies the statement of the lemma. 

□ 
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